Compilation Augmentation Enables High-Performance Batch Differentiation
2023-02-23, 13:30–14:00 (US/Mountain), ECNT 312

Automatic differentiation is the cornerstone of many modern computing applications, such as rendering, optimization, machine learning, and wider scientific computing. In most applications domains, one doesn’t just need to compute a single derivative, but rather the derivative of many values or batches repeatedly. This is especially true for techniques such as automatic differentiation [8], differentiable optimization layers [2, 3], and differentiable sorting [7]. If we consider for example the case where we seek to integrate a differential equation operator in our larger machine learning architecture abstracted away as the function evaluation f, then the result of a call c, which is slow in its evaluation, and is directly involved in the computation of the derivative df of f, evaluating the derivative of f with respect to every input becomes an expensive operation, as the function c has to be called once for each evaluation of the derivative. Evaluating the derivative of f with respect to its inputs in this way is inefficient, and would hence make such differentiable approach unviable. Rather than compute each derivative individually, this work aims to explore how we are able to leverage the compiler for efficient batched differentiation.
With the efficacy of modern machine learning approaches often depending on the performance of the associated training system, larger vector register sizes, and GPUs, the efficient exploitation of vectorizable computation is becoming ever more important. This permeates many modern approaches such as PyTorch [10] with its recent compiler-centric push towards TorchDynamo [4], and TorchInductor [5], Tensorflow [1], Dex [11], Julia [6], and directly inside of the LLVM compiler with the Enzyme compiler-plugin [9] for automatic differentiation.
Unlike existing approaches to single-input and batched differentiation which operate on the source level, we perform batched automatic differentiation within the LLVM compiler by extending Enzyme. Operating within the compiler affords Enzyme access to hardware (vector) registers, enables batching of optimized code, as well as enabling memory layout transformations. Batching of derivative computations inside Enzyme allows us to reuse evaluations of the primal inside a batch. In the case of reverse-mode automatic differentiation, primal values can become unavailable in the reverse pass by being overwritten in memory. Inside of a batch we need to cache or recompute those values only once. Moreover, operating on LLVM enables these performance benefits to extend to any LLVM frontend language (e.g., C/C++, Julia, Rust, Fortran, Swift) and all hardware architectures (e.g., CPU, GPU) using LLVM.
We compare the performance of automated batching on a variety of benchmarks and target memory layout choices, including storing batched data either as a struct-of-arrays or as an array-of-structs. We perform both a comparison against existing non-batched Enzyme derivatives, as well as state-of-the-art batched differentiation tools like Tapenade.

This speaker also appears in: