# Runtime Performance#

In the following section, we assess the performance of wrenfold generated functions.

## Caveats#

As with any benchmarking results, there are some important caveats to consider:

Choice of compiler, compiler version, and CPU architecture may meaningfully alter results.

Compiler flags (for instance, enabling

`-march=native`

) can optimizeorpessimize a given implementation, sometimes in unexpected ways.Performance is a moving target, and results will evolve over time.

For the most accurate outcome, you should test your function in-context with a compiler and CPU architecture that reflect your production environment.

## Overview#

The results reported here were computed using the code in the wrenfold-benchmarks repository. We consider the performance of four functions of increasing complexity:

`QuatLocalCoords`

: Computing the tangent-space difference between two orientations parameterized as quaternions. This is sometimes referred to as thelocal coordinatesoperation:\[f(\mathbf{q}_0, \mathbf{q}_1) = \text{log}\left(\bar{\mathbf{q}}_0 \cdot \mathbf{q}_1\right)\]

`ImuIntegration`

: Performing a single step of IMU preintegration.

`QuatInterpolation`

: Given an interpolation fraction \(\alpha\), compute the tangent-space interpolation between two quaternions:\[f(\mathbf{q}_0, \mathbf{q}_1, \alpha) = \mathbf{q}_0 \cdot \text{exp}\left( \text{log}\left(\bar{\mathbf{q}}_0 \cdot \mathbf{q}_1\right) \cdot \alpha\right)\]

`RollingShutterCamera`

: This function projects a Euclidean point into a moving rolling-shutter camera. The camera uses a first-order (constant velocity) motion model, and the OpenCV intrinsic model with radial and tangential distortion coefficients.

These functions are plausible sub-expressions of a visual-inertial odometry (VIO) or visual-inertial navigation system (VINS) system, which is why they were selected for study. For each function we also compute the tangent-space Jacobians with respect to the input values.

## Jacobian Computation#

When computing Jacobians on manifolds (for example the rotation group \(SO\left(3\right)\)), there are two plausible approaches that are well suited to symbolic code generation:

We can compute the Jacobians with respect to the group variables (for instance, the four variables \(\left[w, x, y, z\right]\) that make up a quaternion) and then chain them with the Jacobian of the

retractionoperation (see section`II.D`

of [1]) evaluated around zero. We refer to this as thechain-rulemethod, which is detailed in section`B.1`

of the SymForce paper.Alternatively, we can first replace the retract operation with a first-order taylor series in the vector variable \(\delta \mathbf{x}\), substitute the series into the function, and then evaluate the result around \(\delta \mathbf{x} = 0\) after computing Jacobians with respect to \(\delta \mathbf{x}\). This method is detailed in section

`B.2`

of the SymForce paper. We refer to this as thefirst-ordermethod. This approach can produce fewer operations in certain instances.

SymForce provides an explicit interface
for manifolds, and defaults to taking derivatives via the first-order method. In order to perform an
apples-to-apples comparison, we apply the *chain-rule* method to both frameworks. There is nothing
to prevent the user from applying the first-order approximation to wrenfold, if they so desire. The
two code-generated implementations we evaluate are:

`XXX_Wrenfold`

: wrenfold symbolic expressions with chain-rule method.

`XXX_SymforceChain`

: SymForce symbolic expressions with chain-rule method. These are directly comparable to the`_Wrenfold`

implementations.

Additionally, we compare to two additional implementations:

`XXX_Handwritten`

: A handwritten implementation. Eigen is used to provide rotation and linear-algebra operations. Jacobians are computed “GTSAM style” by manually chaining together derivatives for each step of the function:\[\frac{\partial \mathbf{f}\left(\mathbf{g}\left(\mathbf{x}\right)\right)} {\partial \mathbf{x}} = \frac{\partial \mathbf{f}\left(\mathbf{u}\right)} {\partial \mathbf{u}} \biggr\rvert_{\mathbf{u} = \mathbf{g}\left(\mathbf{x}\right)} \frac{\partial \mathbf{g}\left(\mathbf{x}\right)} {\partial \mathbf{x}}\]

`XXX_Ceres`

: A handwritten implementation that employs Ceres auto-diff to compute Jacobians.

In all likelihood, a reasonably diligent author can produce a handwritten implementation that
surpasses the code-generated equivalent. We argue that the strength of code generation is the
ability to produce a *competitive* implementation in short order, allowing for quicker development
and evaluation in the context of a production system.

## Results#

The following results were last updated with wrenfold `v0.0.5`

. We collect results for
`gcc 12.3.0-3`

and `clang 16.0.6`

- both using optimization level `-O3`

and `x86_64`

architecture. The benchmark code can be found in the
wrenfold-benchmarks repository.

We report resulting times as multiples of the wrenfold generated implementation (for the same
compiler). A multiple `> 1.0`

indicates that the implementation under comparison is slower than
wrenfold, while a multiple `< 1.0`

is faster. Plots are available below - a few observations
follow:

code-generated functions are

roughlycomparable to the handwritten implementations. For example,`QuatInterpolation`

and`ImuIntegration`

are ~5-15% faster in handwritten form under gcc [2], and ~5-10% slower than wrenfold under clang.When comparing wrenfold to

`SymforceChain`

, we find that:

For the three most complicated functions (

`ImuIntegration`

,`QuatInterpolation`

, and`RollingShutterCamera`

), wrenfold implementations are faster than SymForce.For the

`QuatInterpolation`

test, the SymForce implementations require at least twice the time under both gcc and clang.Auto-differentiated Ceres implementations are always slower than their code-generated equivalents, sometimes by multiples as high as 7x or 8x.

From our (evidently biased) perspective, **the primary takeaway** (with regards to performance) is
that code-generated methods are a comparable substitute for hand-rolled implementations. They can be
used to rapidly prototype mathematical functions while incurring a relatively small performance
trade-off.

Footnotes