Runtime Performance
===================
In the following section, we assess the performance of wrenfold generated functions.
Caveats

As with any benchmarking results, there are some important caveats to consider:
* Choice of compiler, compiler version, and CPU architecture may meaningfully alter results.
* Compiler flags (for instance, enabling ``march=native``) can optimize *or* pessimize a given
implementation, sometimes in unexpected ways.
* Performance is a moving target, and results will evolve over time.
For the most accurate outcome, you should test your function incontext with a compiler and CPU
architecture that reflect your production environment.
Overview

The results reported here were computed using the code in the
`wrenfoldbenchmarks `__ repository. We consider
the performance of four functions of increasing complexity:
* ``QuatLocalCoords``: Computing the `tangentspace `__
difference between two orientations parameterized as quaternions. This is sometimes referred to
as the *local coordinates* operation:
.. math::
f(\mathbf{q}_0, \mathbf{q}_1) = \text{log}\left(\bar{\mathbf{q}}_0 \cdot \mathbf{q}_1\right)
* ``ImuIntegration``: Performing a single step of
`IMU preintegration `__.
* ``QuatInterpolation``: Given an interpolation fraction :math:`\alpha`, compute the tangentspace
interpolation between two quaternions:
.. math::
f(\mathbf{q}_0, \mathbf{q}_1, \alpha) = \mathbf{q}_0 \cdot \text{exp}\left(
\text{log}\left(\bar{\mathbf{q}}_0 \cdot \mathbf{q}_1\right) \cdot \alpha\right)
* ``RollingShutterCamera``: This function projects a Euclidean point into a moving rollingshutter
camera. The camera uses a firstorder (constant velocity) motion model, and the
`OpenCV `__ intrinsic model with radial
and tangential distortion coefficients.
These functions are plausible subexpressions of a `visualinertial odometry (VIO) or
visualinertial navigation system (VINS) `__ system, which is why they
were selected for study. For each function we also compute the tangentspace Jacobians with respect
to the input values.
Jacobian Computation

When computing Jacobians on manifolds (for example the rotation group :math:`SO\left(3\right)`),
there are two plausible approaches that are well suited to symbolic code generation:
#. We can compute the Jacobians with respect to the group variables (for instance, the four
variables :math:`\left[w, x, y, z\right]` that make up a quaternion) and then chain
them with the Jacobian of the *retraction* operation (see section ``II.D`` of [#]_) evaluated
around zero. We refer to this as the *chainrule* method, which is detailed in section ``B.1``
of the `SymForce paper `__.
#. Alternatively, we can first replace the retract operation with a firstorder taylor series in
the vector variable :math:`\delta \mathbf{x}`, substitute the series into the function, and
then evaluate the result around :math:`\delta \mathbf{x} = 0` after computing Jacobians with
respect to :math:`\delta \mathbf{x}`. This method is detailed in section ``B.2`` of the
SymForce paper. We refer to this as the *firstorder* method. This approach can produce fewer
operations in certain instances.
SymForce provides an `explicit interface `__
for manifolds, and defaults to taking derivatives via the firstorder method. In order to perform an
applestoapples comparison, we apply the *chainrule* method to both frameworks. There is nothing
to prevent the user from applying the firstorder approximation to wrenfold, if they so desire. The
two codegenerated implementations we evaluate are:
* ``XXX_Wrenfold``: wrenfold symbolic expressions with chainrule method.
* ``XXX_SymforceChain``: SymForce symbolic expressions with chainrule method. These are directly
comparable to the ``_Wrenfold`` implementations.
Additionally, we compare to two additional implementations:
* ``XXX_Handwritten``: A handwritten implementation. `Eigen `__ is
used to provide rotation and linearalgebra operations. Jacobians are computed "GTSAM style" by
manually chaining together derivatives for each step of the function:
.. math::
\frac{\partial \mathbf{f}\left(\mathbf{g}\left(\mathbf{x}\right)\right)}
{\partial \mathbf{x}} =
\frac{\partial \mathbf{f}\left(\mathbf{u}\right)}
{\partial \mathbf{u}} \biggr\rvert_{\mathbf{u} = \mathbf{g}\left(\mathbf{x}\right)}
\frac{\partial \mathbf{g}\left(\mathbf{x}\right)}
{\partial \mathbf{x}}
* ``XXX_Ceres``: A handwritten implementation that employs
`Ceres autodiff `__ to compute Jacobians.
In all likelihood, a reasonably diligent author can produce a handwritten implementation that
surpasses the codegenerated equivalent. We argue that the strength of code generation is the
ability to produce a *competitive* implementation in short order, allowing for quicker development
and evaluation in the context of a production system.
Results

The following results were last updated with wrenfold ``v0.0.5``. We collect results for
``gcc 12.3.03`` and ``clang 16.0.6``  both using optimization level ``O3`` and ``x86_64``
architecture. The benchmark code can be found in the
`wrenfoldbenchmarks `_ repository.
We report resulting times as multiples of the wrenfold generated implementation (for the same
compiler). A multiple ``> 1.0`` indicates that the implementation under comparison is slower than
wrenfold, while a multiple ``< 1.0`` is faster. Plots are available below  a few observations
follow:
* codegenerated functions are *roughly* comparable to the handwritten implementations. For example,
``QuatInterpolation`` and ``ImuIntegration`` are ~515% faster in handwritten form under gcc [#]_,
and ~510% slower than wrenfold under clang.
* When comparing wrenfold to ``SymforceChain``, we find that:
* For the three most complicated functions (``ImuIntegration``, ``QuatInterpolation``,
and ``RollingShutterCamera``), wrenfold implementations are faster than SymForce.
* For the ``QuatInterpolation`` test, the SymForce implementations require at least twice the
time under both gcc and clang.
* Autodifferentiated Ceres implementations are always slower than their codegenerated
equivalents, sometimes by multiples as high as 7x or 8x.
From our (evidently biased) perspective, **the primary takeaway** (with regards to performance) is
that codegenerated methods are a comparable substitute for handrolled implementations. They can be
used to rapidly prototype mathematical functions while incurring a relatively small performance
tradeoff.
.. raw:: html
:file: _static/benchmark_plots/QuatLocalCoordsgcc.html
.. raw:: html
:file: _static/benchmark_plots/QuatLocalCoordsclang.html

.. raw:: html
:file: _static/benchmark_plots/ImuIntegrationgcc.html
.. raw:: html
:file: _static/benchmark_plots/ImuIntegrationclang.html

.. raw:: html
:file: _static/benchmark_plots/QuatInterpolationgcc.html
.. raw:: html
:file: _static/benchmark_plots/QuatInterpolationclang.html

.. raw:: html
:file: _static/benchmark_plots/RollingShutterCameragcc.html
.. raw:: html
:file: _static/benchmark_plots/RollingShutterCameraclang.html
.. rubric:: Footnotes
.. [#] `A micro Lie theory for state estimation in robotics `__
.. [#] A comparison of the generated assembly reveals that Eigen is sometimes able to more
effectively leverage SIMD operations, which may explain this gap. This suggests an avenue for
possible future improvements to the codegeneration framework. See this
`related issue `__.