Differentiating functionals
Finite differencing
The definition of the derivative
is
where is the vector with 0 in all entries except for 1 in
the entry. Each component of the gradient
vector is the derivative of
the functional with respect to the corresponding
component of . A simple idea for approximating the derivative
is to compute each component of the gradient as
for some small choice of . The advantage of this approach is
that it is very straightforward: it still only requires a black-box
evaluator for , and the approximation of the
gradients can be done entirely within the optimisation algorithm.
However, this approach suffers from several serious drawbacks. One
problem is that it is not obvious how to choose an appropriate value
for : choose too large, and the finite difference
will not approximate the limit value; choose too small, and
numerical precision will destroy the accuracy of the approximation. A
more serious problem, however, is that this approximation requires one
functional evaluation for each degree of freedom in the parameter
space. When each functional evaluation requires an expensive PDE
solve, this approach quickly becomes impractical, and a more
intelligent algorithm is required.
The tangent linear approach
Recall that is the functional considered as a
pure function of :
Let us apply the chain rule to :
Let us inspect each term of this relationship, and build up some
intuition about each. and
are typically very straightforward
to compute: is usually a simple closed-form expression in
terms of and , and so their differentiation by hand
is generally trivial. Both of these quantities are vectors, with
dimensions of the parameter space and solution space respectively. By
contrast, the solution Jacobian is
rather difficult to compute. This object is a massive dense matrix,
of dimensions (solution space parameter space), and as
such it is unlikely to fit in memory. However, let us temporarily
suppose that the number of parameters is small, and that we would like
to compute using the
relationship above.
With the PDE , we have an relationship for
as an implicit function of . If we take the total
derivative of this equation with respect to , we will have a
relationship for the solution Jacobian
:
This last relationship is the tangent linear equation (or tangent
linear system) associated with the PDE . Let us
carefully consider each term in the tangent linear system, and build
up some intuition about each.
is the solution Jacobian again,
with which we can compute the functional gradient
. It is the prognostic
variable of this equation, the unknown quantity in the tangent linear
system.
Now consider . Since
is a vector expression, its derivative with respect to
is an operator (a matrix); this operator acts on the
solution Jacobian, and therefore must be inverted or
solved. may have been nonlinear in , but
is always linear. In other
words, is the linearisation
of the equation operator, linearised about a particular solution
. If happened to be linear in the first
place, and so for some operator
, then is just the
operator back again.
Finally, consider the term . Like , this is a matrix of
dimension (solution space parameter space). This term
acts as the source term for the tangent linear system; each column of
provides the source term for
the derivative of with respect to one scalar entry in the
parameter vector.
So, when is solving the tangent linear system a sensible approach?
To answer this question, notice that we had to specify some parameter
to construct the tangent linear system, but that the
functional does not appear at all. In other words, for a
given parameter (input), the tangent linear solution can be used to
easily compute the gradient of any functional. This means that
solving the tangent linear system makes sense when there are a small
number of parameters (inputs), and a large number of functionals of
interest (outputs). However, this is generally not the case in
PDE-constrained optimisation. Is there a better way?
The adjoint approach
Let us rephrase the tangent linear approach to computing the
gradient. We start by fixing our choice of parameter , and
then solve for the solution Jacobian
associated with that choice of
. With this quantity in hand, we take its inner product with
a source term particular to the
functional , and can then compute the gradient
.
Notice that we first fixed the parameter , (the “denominator”
of the gradient ) and
then chose which functional we wished to compute the gradient of
(the “numerator” of the gradient). Is there a way where we could do
the opposite: first fix the functional , and then choose
which parameter to take the gradient with respect to? The answer is
yes, and that approach is referred to as the adjoint approach.
Suppose the tangent linear system is invertible. Then we can rewrite
the solution Jacobian as
We usually could not compute this expression (computing the inverse of
the operator is prohibitive),
but we can still use it and reason about it. Let us substitute this
expression for the solution Jacobian into the expression for the
gradient of :
Now let’s take the adjoint (Hermitian transpose) of the above equation:
Let us gather the solution of the inverse Jacobian acting on a vector, and define it
to be a new variable:
This relationship is the adjoint equation (or adjoint system)
associated with the PDE . Again, let us carefully
consider each term in the adjoint equation and build up some intuition
about each.
is the adjoint variable associated with
. Each component of the solution will have a
corresponding adjoint variable. For example, if is the
Navier-Stokes equations, and is the tuple of velocity and
pressure, then is the tuple of adjoint velocity and
adjoint pressure. Similarly, if the problem is time-dependent, the
adjoint is also time-dependent, with each variable through time having
a corresponding adjoint value.
is the
adjoint of the tangent linear operator. Commonly, this is referred
as the “adjoint operator”. By taking the transpose, we reverse the
flow of information in the equation system. For example, if a tracer
is advected downstream (and so information about upstream conditions
is advected with it), the adjoint PDE advects information in the
reverse sense, i.e. upstream. This extends to the temporal propagation
of information: if is a time-dependent PDE (and so
propagates information from earlier times to later times), the adjoint
PDE runs backwards in time (propagates information from later times
to earlier times). This property will be examined in more detail in
the next section.
is the source term for the adjoint
equation. It is this source term that makes an adjoint solution
specific to a particular functional. Just as one cannot speak of the
tangent linear solution without referring to a particular choice of
parameter, one cannot speak of the adjoint solution without referring
to a specific choice of functional.
As the tangent linear operator is always linear, the adjoint is linear
in also, and so the adjoint equation is always linear in
. This property will also be examined in more detail in
the next section.
So, to compute the functional gradient
, we first solve the
adjoint equation for (fixing the “nominator” of the
gradient, as the adjoint is specific to the functional), and then take
its inner product with respect to to compute the gradient with respect to a particular parameter
(fixing the “denominator” of the gradient). This is
precisely the dual approach to that of computing
using the tangent linear
approach, and has precisely the dual scaling: for a given functional
(output), the adjoint solution can be used to easily compute the
gradient with respect to any parameter. Therefore, solving the
adjoint system is extremely efficient when there are a small number
of functionals (outputs), and a large number of parameters
(inputs). This is precisely the case we are considering in
PDE-constrained optimisation: there is one functional (output) of
interest, but many parameters.
So, with some knowledge of the chain rule and some transposition,
we have devised an algorithm for computing the gradient
that is extremely
efficient for our case where we have many parameters and only one
functional.
Summary
A sketch of the solution approach for the PDE-constrained optimisation
problem is therefore:
Start with some initial guess for the parameters .
Compute the functional (using the forward model) and its gradient (using the adjoint model).
Pass these values to an optimisation algorithm.
This algorithm returns a new point in parameter space with a better functional value.
If the gradient is zero, or if the maximum number of iterations has been reached, terminate. Otherwise, go to step 2.
Of course, PDE-constrained optimisation is a much richer field than
the simple sketch above would suggest. Much work is focussed on
exploiting particular properties of the equations or the functional,
ensuring the gradient is represented with the correct Riesz representer,
or imposing additional constraints on the parameter space, or
exploiting advanced forward modelling concepts such as error
estimation, goal-based adaptivity and reduced-order
modelling. Nevertheless, although complications proliferate, the above
algorithm captures the key idea of many approaches used for solving
problems of enormous importance.
With the adjoint and tangent linear equations now introduced, let us
examine them more thoroughly, in the next section.
References