Runtime Performance#
In the following section, we assess the performance of wrenfold generated functions.
Caveats#
As with any benchmarking results, there are some important caveats to consider:
Choice of compiler, compiler version, and CPU architecture may meaningfully alter results.
Compiler flags (for instance, enabling
-march=native
) can optimize or pessimize a given implementation, sometimes in unexpected ways.Performance is a moving target, and results will evolve over time.
For the most accurate outcome, you should test your function in-context with a compiler and CPU architecture that reflect your production environment.
Overview#
The results reported here were computed using the code in the wrenfold-benchmarks repository. We consider the performance of four functions of increasing complexity:
QuatLocalCoords
: Computing the tangent-space difference between two orientations parameterized as quaternions. This is sometimes referred to as the local coordinates operation:\[f(\mathbf{q}_0, \mathbf{q}_1) = \text{log}\left(\bar{\mathbf{q}}_0 \cdot \mathbf{q}_1\right)\]
ImuIntegration
: Performing a single step of IMU preintegration.
QuatInterpolation
: Given an interpolation fraction \(\alpha\), compute the tangent-space interpolation between two quaternions:\[f(\mathbf{q}_0, \mathbf{q}_1, \alpha) = \mathbf{q}_0 \cdot \text{exp}\left( \text{log}\left(\bar{\mathbf{q}}_0 \cdot \mathbf{q}_1\right) \cdot \alpha\right)\]
RollingShutterCamera
: This function projects a Euclidean point into a moving rolling-shutter camera. The camera uses a first-order (constant velocity) motion model, and the OpenCV intrinsic model with radial and tangential distortion coefficients.
These functions are plausible sub-expressions of a visual-inertial odometry (VIO) or visual-inertial navigation system (VINS) system, which is why they were selected for study. For each function we also compute the tangent-space Jacobians with respect to the input values.
Jacobian Computation#
When computing Jacobians on manifolds (for example the rotation group \(SO\left(3\right)\)), there are two plausible approaches that are well suited to symbolic code generation:
We can compute the Jacobians with respect to the group variables (for instance, the four variables \(\left[w, x, y, z\right]\) that make up a quaternion) and then chain them with the Jacobian of the retraction operation (see section
II.D
of [1]) evaluated around zero. We refer to this as the chain-rule method, which is detailed in sectionB.1
of the SymForce paper.Alternatively, we can first replace the retract operation with a first-order taylor series in the vector variable \(\delta \mathbf{x}\), substitute the series into the function, and then evaluate the result around \(\delta \mathbf{x} = 0\) after computing Jacobians with respect to \(\delta \mathbf{x}\). This method is detailed in section
B.2
of the SymForce paper. We refer to this as the first-order method. This approach can produce fewer operations in certain instances.
SymForce provides an explicit interface for manifolds, and defaults to taking derivatives via the first-order method. In order to perform an apples-to-apples comparison, we apply the chain-rule method to both frameworks. There is nothing to prevent the user from applying the first-order approximation to wrenfold, if they so desire. The two code-generated implementations we evaluate are:
XXX_Wrenfold
: wrenfold symbolic expressions with chain-rule method.
XXX_SymforceChain
: SymForce symbolic expressions with chain-rule method. These are directly comparable to the_Wrenfold
implementations.
Additionally, we compare to two additional implementations:
XXX_Handwritten
: A handwritten implementation. Eigen is used to provide rotation and linear-algebra operations. Jacobians are computed “GTSAM style” by manually chaining together derivatives for each step of the function:\[\frac{\partial \mathbf{f}\left(\mathbf{g}\left(\mathbf{x}\right)\right)} {\partial \mathbf{x}} = \frac{\partial \mathbf{f}\left(\mathbf{u}\right)} {\partial \mathbf{u}} \biggr\rvert_{\mathbf{u} = \mathbf{g}\left(\mathbf{x}\right)} \frac{\partial \mathbf{g}\left(\mathbf{x}\right)} {\partial \mathbf{x}}\]
XXX_Ceres
: A handwritten implementation that employs Ceres auto-diff to compute Jacobians.
In all likelihood, a reasonably diligent author can produce a handwritten implementation that surpasses the code-generated equivalent. We argue that the strength of code generation is the ability to produce a competitive implementation in short order, allowing for quicker development and evaluation in the context of a production system.
Results#
The following results were last updated with wrenfold v0.0.5
. We collect results for
gcc 12.3.0-3
and clang 16.0.6
- both using optimization level -O3
and x86_64
architecture. The benchmark code can be found in the
wrenfold-benchmarks repository.
We report resulting times as multiples of the wrenfold generated implementation (for the same
compiler). A multiple > 1.0
indicates that the implementation under comparison is slower than
wrenfold, while a multiple < 1.0
is faster. Plots are available below - a few observations
follow:
code-generated functions are roughly comparable to the handwritten implementations. For example,
QuatInterpolation
andImuIntegration
are ~5-15% faster in handwritten form under gcc [2], and ~5-10% slower than wrenfold under clang.When comparing wrenfold to
SymforceChain
, we find that:
For the three most complicated functions (
ImuIntegration
,QuatInterpolation
, andRollingShutterCamera
), wrenfold implementations are faster than SymForce.For the
QuatInterpolation
test, the SymForce implementations require at least twice the time under both gcc and clang.Auto-differentiated Ceres implementations are always slower than their code-generated equivalents, sometimes by multiples as high as 7x or 8x.
From our (evidently biased) perspective, the primary takeaway (with regards to performance) is that code-generated methods are a comparable substitute for hand-rolled implementations. They can be used to rapidly prototype mathematical functions while incurring a relatively small performance trade-off.
Footnotes