Runtime Performance

In the following section, we assess the performance of wrenfold generated functions.

Caveats

As with any benchmarking results, there are some important caveats to consider:

  • Choice of compiler, compiler version, and CPU architecture may meaningfully alter results.

  • Compiler flags (for instance, enabling -march=native) can optimize or pessimize a given implementation, sometimes in unexpected ways.

  • Performance is a moving target, and results will evolve over time.

For the most accurate outcome, you should test your function in-context with a compiler and CPU architecture that reflect your production environment.

Overview

The results reported here were computed using the code in the wrenfold-benchmarks repository. We consider the performance of four functions of increasing complexity:

  • QuatLocalCoords: Computing the tangent-space difference between two orientations parameterized as quaternions. This is sometimes referred to as the local coordinates operation:

    \[f(\mathbf{q}_0, \mathbf{q}_1) = \text{log}\left(\bar{\mathbf{q}}_0 \cdot \mathbf{q}_1\right)\]
  • ImuIntegration: Performing a single step of IMU preintegration.

  • QuatInterpolation: Given an interpolation fraction \(\alpha\), compute the tangent-space interpolation between two quaternions:

    \[f(\mathbf{q}_0, \mathbf{q}_1, \alpha) = \mathbf{q}_0 \cdot \text{exp}\left( \text{log}\left(\bar{\mathbf{q}}_0 \cdot \mathbf{q}_1\right) \cdot \alpha\right)\]
  • RollingShutterCamera: This function projects a Euclidean point into a moving rolling-shutter camera. The camera uses a first-order (constant velocity) motion model, and the OpenCV intrinsic model with radial and tangential distortion coefficients.

These functions are plausible sub-expressions of a visual-inertial odometry (VIO) or visual-inertial navigation system (VINS) system, which is why they were selected for study. For each function we also compute the tangent-space Jacobians with respect to the input values.

Jacobian Computation

When computing Jacobians on manifolds (for example the rotation group \(SO\left(3\right)\)), there are two plausible approaches that are well suited to symbolic code generation:

  1. We can compute the Jacobians with respect to the group variables (for instance, the four variables \(\left[w, x, y, z\right]\) that make up a quaternion) and then chain them with the Jacobian of the retraction operation (see section II.D of [1]) evaluated around zero. We refer to this as the chain-rule method, which is detailed in section B.1 of the SymForce paper.

  2. Alternatively, we can first replace the retract operation with a first-order taylor series in the vector variable \(\delta \mathbf{x}\), substitute the series into the function, and then evaluate the result around \(\delta \mathbf{x} = 0\) after computing Jacobians with respect to \(\delta \mathbf{x}\). This method is detailed in section B.2 of the SymForce paper. We refer to this as the first-order method. This approach can produce fewer operations in certain instances.

SymForce provides an explicit interface for manifolds, and defaults to taking derivatives via the first-order method. In order to perform an apples-to-apples comparison, we apply the chain-rule method to both frameworks. There is nothing to prevent the user from applying the first-order approximation to wrenfold, if they so desire. The two code-generated implementations we evaluate are:

  • XXX_Wrenfold: wrenfold symbolic expressions with chain-rule method.

  • XXX_SymforceChain: SymForce symbolic expressions with chain-rule method. These are directly comparable to the _Wrenfold implementations.

Additionally, we compare to two additional implementations:

  • XXX_Handwritten: A handwritten implementation. Eigen is used to provide rotation and linear-algebra operations. Jacobians are computed “GTSAM style” by manually chaining together derivatives for each step of the function:

    \[\frac{\partial \mathbf{f}\left(\mathbf{g}\left(\mathbf{x}\right)\right)} {\partial \mathbf{x}} = \frac{\partial \mathbf{f}\left(\mathbf{u}\right)} {\partial \mathbf{u}} \biggr\rvert_{\mathbf{u} = \mathbf{g}\left(\mathbf{x}\right)} \frac{\partial \mathbf{g}\left(\mathbf{x}\right)} {\partial \mathbf{x}}\]
  • XXX_Ceres: A handwritten implementation that employs Ceres auto-diff to compute Jacobians.

In all likelihood, a reasonably diligent author can produce a handwritten implementation that surpasses the code-generated equivalent. We argue that the strength of code generation is the ability to produce a competitive implementation in short order, allowing for quicker development and evaluation in the context of a production system.

Results

The following results were last updated with wrenfold v0.0.5. We collect results for gcc 12.3.0-3 and clang 16.0.6 - both using optimization level -O3 and x86_64 architecture. The benchmark code can be found in the wrenfold-benchmarks repository.

We report resulting times as multiples of the wrenfold generated implementation (for the same compiler). A multiple > 1.0 indicates that the implementation under comparison is slower than wrenfold, while a multiple < 1.0 is faster. Plots are available below - a few observations follow:

  • code-generated functions are roughly comparable to the handwritten implementations. For example, QuatInterpolation and ImuIntegration are ~5-15% faster in handwritten form under gcc [2], and ~5-10% slower than wrenfold under clang.

  • When comparing wrenfold to SymforceChain, we find that:

    • For the three most complicated functions (ImuIntegration, QuatInterpolation, and RollingShutterCamera), wrenfold implementations are faster than SymForce.

    • For the QuatInterpolation test, the SymForce implementations require at least twice the time under both gcc and clang.

  • Auto-differentiated Ceres implementations are always slower than their code-generated equivalents, sometimes by multiples as high as 7x or 8x.

From our (evidently biased) perspective, the primary takeaway (with regards to performance) is that code-generated methods are a comparable substitute for hand-rolled implementations. They can be used to rapidly prototype mathematical functions while incurring a relatively small performance trade-off.




Footnotes