# Fast, Branchless Ray/Bounding Box Intersections, Part 3: Boundaries

This post is the third in a series of posts I've written about ray/bounding box intersection tests. However, it supersedes the previous two, and should be self contained, so don't feel like you have to read the others unless you're interested in the historical context. On and I substantially revised this post, improving the implementation and adding a much more thorough performance analysis.

Last time I wrote about ray/box intersections, we left off with a simple and fast implementation of the slab method that handles all the corner cases. To save you the trouble of (re-)reading the last two posts, I'll start with a quick recap. Feel free to skip to the next section if you're already familiar with it.

## Geometry

An axis-aligned bounding box $\square$ can be specified by the coordinates of two of its corners:

A point $(x, y, z)$ is inside the box if and only if

$\begin{alignedat}{3} & x_{\min{}} && < x && < x_{\max{}} \\ & y_{\min{}} && < y && < y_{\max{}} \\ & z_{\min{}} && < z && < z_{\max{}}. \\ \end{alignedat}$

Equivalently, the interior of the box is the intersection of many half-spaces:

The slab method tests for an intersection between a ray and an AABB (axis-aligned bounding box) by repeatedly clipping the line with these half-spaces. In the end, what's left of the ray is the segment that intersects the box, if any.

A ray is typically defined parametrically by

$\ell(t) = \begin{pmatrix} x_0 \\ y_0 \\ z_0 \\ \end{pmatrix} + t \begin{pmatrix} x_d \\ y_d \\ z_d \\ \end{pmatrix},$

where $(x_0, y_0, z_0)$ is the origin of the ray and $(x_d, y_d, z_d)$ is the ray's direction. $t > 0$ is considered “in front” of the origin, while $t < 0$ is “behind” it. We can find where the ray intersects an axis-aligned plane by solving (for example)

$\ell_x(t) = x_0 + t x_d = x$

for

$t = \frac{x - x_0}{x_d}.$

The segment of the ray that intersects the $x$ planes of the bounding box is then

$\begin{aligned} t_{\min{}} & = \min \{ t_1, t_2 \} \\ t_{\max{}} & = \max \{ t_1, t_2 \}. \\ \end{aligned}$

The intersection of two segments $(u_{\min{}}, u_{\max{}})$ and $(v_{\min{}}, v_{\max{}})$ is just

$\begin{aligned} t_{\min{}} & = \max \{ u_{\min{}}, v_{\min{}} \} \\ t_{\max{}} & = \min \{ u_{\max{}}, v_{\max{}} \}. \\ \end{aligned}$

If we end up with $t_{\min{}} > t_{\max{}}$, the intersection is empty.

Putting all this together leads to a naturally branch-free ray/AABB intersection test (assuming `min()`

and `max()`

are branch-free and loops are unrolled):

```
/// Parametric representation of a ray.
struct ray {
float origin[3];
float dir[3];
float dir_inv[3];
};
/// An axis-aligned bounding box.
struct box {
float min[3];
float max[3];
};
bool intersection(const struct ray *ray, const struct box *box) {
float tmin = 0.0, tmax = INFINITY;
for (int i = 0; i < 3; ++i) {
float t1 = (box->min[i] - ray->origin[i]) * ray->dir_inv[i];
float t2 = (box->max[i] - ray->origin[i]) * ray->dir_inv[i];
tmin = max(tmin, min(t1, t2));
tmax = min(tmax, max(t1, t2));
}
return tmin < tmax;
}
```

We precompute `ray->dir_inv[i] = 1.0 / ray->dir[i]`

to replace division (slow) with multiplication (fast).
You might worry about dividing by zero, but part 1 describes how floating point infinities lead to the right answer even when `ray->dir[i] == 0.0`

.

You might also worry about the multiplication, since in floating point, $0 * \infty \equiv \mathrm{NaN}$. Luckily, part 2 describes a simple fix:

```
- tmin = max(tmin, min(t1, t2));
- tmax = min(tmax, max(t1, t2));
+ tmin = max(tmin, min(min(t1, t2), tmax));
+ tmax = min(tmax, max(max(t1, t2), tmin));
```

```
bool intersection(const struct ray *ray, const struct box *box) {
float tmin = 0.0, tmax = INFINITY;
for (int i = 0; i < 3; ++i) {
float t1 = (box->min[i] - ray->origin[i]) * ray->dir_inv[i];
float t2 = (box->max[i] - ray->origin[i]) * ray->dir_inv[i];
tmin = max(tmin, min(min(t1, t2), tmax));
tmax = min(tmax, max(max(t1, t2), tmin));
}
return tmin < tmax;
}
```

## Boundaries

You may have noticed that all the inequalities above are strict ($<$, not $\le$). That means rays which just touch a corner, edge, or face of the bounding box will be considered non-intersecting.

This might not be too big a deal, since rays that exactly touch the boundary of the box are a somewhat degenerate case.
Of course, degenerate doesn't mean *impossible* or even *unlikely*, so it would be nice to include the boundary.
Several comments on the last post agree.

Sadly, relaxing the inequalities fixes the corner case, but not the edge case, since that's the one involving NaN:

```
- return tmin < tmax;
+ return tmin <= tmax;
```

I alluded to this in the previous post:

(Since this is an edge case, you might wonder why we don't choose to return

`true`

instead of`false`

for rays on the boundary. It turns out to be much harder to get this behaviour with efficient code.)

I'm happy to announce that after ~~seven years of grueling research~~ thinking about it for a few minutes yesterday, I finally found an efficient implementation that includes the boundary.
If `min()`

and `max()`

are implemented in the obvious way:

```
static inline float min(float x, float y) {
return x < y ? x : y;
}
static inline float max(float x, float y) {
return x > y ? x : y;
}
```

(which is also typically the most efficient way, since it matches the x86 SSE min/max instructions), then we get some curious semantics around NaN:

$\begin{aligned} \min\{x, \mathrm{NaN}\} & = \max\{x, \mathrm{NaN}\} = \mathrm{NaN} \\ \min\{\mathrm{NaN}, y\} & = \max\{\mathrm{NaN}, y\} = y. \\ \end{aligned}$

We'd like to keep `tmin`

and `tmax`

unchanged if either `t1`

or `t2`

is NaN, so swapping the arguments to keep `tmin`

/`tmax`

on the right seems like a good idea:

```
- tmin = max(tmin, min(t1, t2));
- tmax = min(tmax, max(t1, t2));
+ tmin = max(min(t1, t2), tmin);
+ tmax = min(max(t1, t2), tmax);
```

This works if `t2`

is NaN, but doesn't handle `t1`

.
The last piece of the puzzle is to push `tmin`

/`tmax`

into the inner `min()`

/`max()`

, so that at least one argument is always non-NaN:

```
- tmin = max(min(t1, t2), tmin);
- tmax = min(max(t1, t2), tmax);
+ tmin = min(max(t1, tmin), max(t2, tmin));
+ tmax = max(min(t1, tmax), min(t2, tmax));
```

```
bool intersection(const struct ray *ray, const struct box *box) {
float tmin = 0.0, tmax = INFINITY;
for (int i = 0; i < 3; ++i) {
float t1 = (box->min[i] - ray->origin[i]) * ray->dir_inv[i];
float t2 = (box->max[i] - ray->origin[i]) * ray->dir_inv[i];
tmin = min(max(t1, tmin), max(t2, tmin));
tmax = max(min(t1, tmax), min(t2, tmax));
}
return tmin <= tmax;
}
```

## Batching

With the ray/box intersection API as it currently is, a simple ray tracer might call it like this:

```
struct object *object = NULL;
float t = INFINITY;
for (int i = 0; i < nobjects; ++i) {
if (intersection(ray, objects[i]->bounding_box)) {
if (objects[i]->intersection(ray, &t)) {
object = objects[i];
}
}
}
```

This is suboptimal in a few ways. For starters, we're keeping track of the $t$ value of the closest intersecting object, but the ray/box intersection test is ignoring it. We should pass it in to avoid looking inside bounding boxes that are too far away:

```
-bool intersection(const struct ray *ray, const struct box *box) {
- float tmin = 0.0, tmax = INFINITY;
+bool intersection(const struct ray *ray, const struct box *box, float tmax) {
+ float tmin = 0.0;
...
for (int i = 0; i < nobjects; ++i) {
- if (intersection(ray, objects[i]->bounding_box)) {
+ if (intersection(ray, objects[i]->bounding_box, t)) {
if (objects[i]->intersection(ray, &t)) {
```

Another problem is that we just spent all this effort coming up with a fast, branchless intersection test, and then immediately branched on the result.
We could probably get more throughput by testing a ray against a batch of bounding boxes all at once, before we move onto the more expensive ray/object intersections.
But if we do that, `bool`

(or even `bool[]`

) is probably the wrong return type.
Instead, we should save the $t$ values we calculated:

```
void intersections(
const struct ray *ray,
size_t nboxes,
const struct box boxes[nboxes],
float ts[nboxes])
{
for (size_t i = 0; i < nboxes; ++i) {
const struct box *box = &boxes[i];
float tmin = 0.0, tmax = ts[i];
for (int j = 0; j < 3; ++j) {
float t1 = (box->min[j] - ray->origin[j]) * ray->dir_inv[j];
float t2 = (box->max[j] - ray->origin[j]) * ray->dir_inv[j];
tmin = min(max(t1, tmin), max(t2, tmin));
tmax = max(min(t1, tmax), min(t2, tmax));
}
ts[i] = tmin <= tmax ? tmin : ts[i];
}
}
```

and use them later to decide which boxes might still intersect the ray segment:

```
// Pseudo-code, don't really use VLAs :)
float ts[nobjects] = {INFINITY, INFINITY, ...};
intersections(ray, nobjects, bounding_boxes, ts);
for (size_t i = 0; i < nobjects; ++i) {
if (ts[i] < t) {
if (objects[i]->intersection(ray, &t)) {
object = objects[i];
}
}
}
```

This API also allows the implementation to be easily vectorized, parallelized, or even offloaded to a GPU.

## Performance

I measured the performance of three variations of this algorithm. The baseline implementation (from part 1) makes no attempt to handle NaN consistently. The "exclusive" implementation (from part 2) consistently treats the boundary of the box as non-intersecting. Finally, the "inclusive" implementation (from this post) includes the boundary as part of the box.

Measurements were taken on an 3.80 GHz AMD Ryzen Threadripper 3960X with 64 GiB of RAM, running Arch Linux.
CPU frequency scaling was disabled, as was Turbo Core (AMD's version of Turbo Boost), so the clock speed was constant throughout the expirement.
The code was compiled by Clang 14.0.6 with the flags `-O3 -flto -march=native`

.
I also tried GCC 12.2.0, but it did not consistently eliminate branches or vectorize the loop, making it much slower.
The hot loop of each configuration looks like:

```
for (size_t i = 0; i < nboxes; ++i) {
const struct box *box = &boxes[i];
float tmin = 0.0, tmax = ts[i];
for (int j = 0; j < 3; ++j) {
float t1 = (box->min[j] - ray->origin[j]) * ray->dir_inv[j];
float t2 = (box->max[j] - ray->origin[j]) * ray->dir_inv[j];
tmin = max(tmin, min(t1, t2));
tmax = min(tmax, max(t1, t2));
}
ts[i] = tmin < tmax ? tmin : ts[i];
}
```

```
for (size_t i = 0; i < nboxes; ++i) {
const struct box *box = &boxes[i];
float tmin = 0.0, tmax = ts[i];
for (int j = 0; j < 3; ++j) {
float t1 = (box->min[j] - ray->origin[j]) * ray->dir_inv[j];
float t2 = (box->max[j] - ray->origin[j]) * ray->dir_inv[j];
tmin = max(tmin, min(min(t1, t2), tmax));
tmax = min(tmax, max(max(t1, t2), tmin));
}
ts[i] = tmin < tmax ? tmin : ts[i];
}
```

```
for (size_t i = 0; i < nboxes; ++i) {
const struct box *box = &boxes[i];
float tmin = 0.0, tmax = ts[i];
for (int j = 0; j < 3; ++j) {
float t1 = (box->min[j] - ray->origin[j]) * ray->dir_inv[j];
float t2 = (box->max[j] - ray->origin[j]) * ray->dir_inv[j];
tmin = min(max(t1, tmin), max(t2, tmin));
tmax = max(min(t1, tmax), min(t2, tmax));
}
ts[i] = tmin <= tmax ? tmin : ts[i];
}
```

The benchmark tested intersections between a ray and a complete octree of various depths. Approximately 10 billion intersection tests were performed for each configuration. The observed throughput, measured in billions of intersections per second, is shown in the graph below.

## Vectorization

Clang managed to automatically vectorize the implementation and achieve impressive performance.
However, humans can often still beat the compiler at vectorization, so let's try it.
My CPU supports AVX2, which gives us 256-bit registers that hold 8 `float`

s:

```
#include <xmmintrin.h>
#include <immintrin.h>
/// A vector of 8 floats.
typedef __m256 vfloat;
/// Broadcast a float across all elements of a vector.
#define broadcast(x) _mm256_set1_ps(x)
/// Elementwise vector minimum.
#define min(x, y) _mm256_min_ps(x, y)
/// Elementwise vector maximum.
#define max(x, y) _mm256_max_ps(x, y)
/// A vectorized ray, holding 8 rays at once.
struct vray {
vfloat origin[3];
vfloat dir_inv[3];
};
/// A vectorized bounding box, holding 8 boxes.
struct vbox {
vfloat min[3];
vfloat max[3];
};
```

The implementation looks almost identical, mostly just replacing single `float`

s with vectors.
Each iteration of the outer loop is now intersecting 8 rays (or 8 copies of the same ray) against 8 bounding boxes.

```
void intersections(
const struct vray *ray,
size_t nboxes,
const struct vbox boxes[nboxes],
vfloat ts[nboxes])
{
for (size_t i = 0; i < nboxes; ++i) {
const struct vbox *box = &boxes[i];
vfloat tmin = broadcast(0.0);
vfloat tmax = ts[i];
for (int j = 0; j < 3; ++j) {
vfloat t1 = (box->min[j] - ray->origin[j]) * ray->dir_inv[j];
vfloat t2 = (box->max[j] - ray->origin[j]) * ray->dir_inv[j];
tmin = max(tmin, min(t1, t2));
tmax = min(tmax, max(t1, t2));
}
// Use a comparison mask and blend to vectorize
// ts[i] = tmin < tmax ? tmin : ts[i];
vfloat mask = _mm256_cmp_ps(tmin, tmax, _CMP_LT_OQ);
ts[i] = _mm256_blendv_ps(ts[i], tmin, mask);
}
}
```

```
void intersections(
const struct vray *ray,
size_t nboxes,
const struct vbox boxes[nboxes],
vfloat ts[nboxes])
{
for (size_t i = 0; i < nboxes; ++i) {
const struct vbox *box = &boxes[i];
vfloat tmin = broadcast(0.0);
vfloat tmax = ts[i];
for (int j = 0; j < 3; ++j) {
vfloat t1 = (box->min[j] - ray->origin[j]) * ray->dir_inv[j];
vfloat t2 = (box->max[j] - ray->origin[j]) * ray->dir_inv[j];
tmin = max(tmin, min(min(t1, t2), tmax));
tmax = min(tmax, max(max(t1, t2), tmin));
}
// Use a comparison mask and blend to vectorize
// ts[i] = tmin < tmax ? tmin : ts[i];
vfloat mask = _mm256_cmp_ps(tmin, tmax, _CMP_LT_OQ);
ts[i] = _mm256_blendv_ps(ts[i], tmin, mask);
}
}
```

```
void intersections(
const struct vray *ray,
size_t nboxes,
const struct vbox boxes[nboxes],
vfloat ts[nboxes])
{
for (size_t i = 0; i < nboxes; ++i) {
const struct vbox *box = &boxes[i];
vfloat tmin = broadcast(0.0);
vfloat tmax = ts[i];
for (int j = 0; j < 3; ++j) {
vfloat t1 = (box->min[j] - ray->origin[j]) * ray->dir_inv[j];
vfloat t2 = (box->max[j] - ray->origin[j]) * ray->dir_inv[j];
tmin = min(max(t1, tmin), max(t2, tmin));
tmax = max(min(t1, tmax), min(t2, tmax));
}
// Use a comparison mask and blend to vectorize
// ts[i] = tmin <= tmax ? tmin : ts[i];
vfloat mask = _mm256_cmp_ps(tmin, tmax, _CMP_LE_OQ);
ts[i] = _mm256_blendv_ps(ts[i], tmin, mask);
}
}
```

Throughput is significantly improved, at least until memory becomes the bottleneck. We're reaching between 2 and 2.5 billion intersections/s, around 2.6× faster than the un/auto-vectorized implementation.

## Parallelism

All the results so far were single-threaded, but I hope no one is writing single-threaded (CPU) ray tracers these days. Let's see what happens if we crank up the parallelism:

As long as we're not bottlenecked on memory, each added thread contributes significant additional throughput. The best configurations reach over 61 billion intersections/second! It scales quite nicely too:

These benchmarks are on GitHub if you want to try them yourself. The code is MIT licensed, so feel free to adapt it for your own uses.