Linear Optimisation

Lead Author: Jordan

Linear optimisation is the fundamental crux serving as a basis for most modern optimisation techniques, even for non-linear problems. In this section, I focus on the well known problem

\[\begin{equation} \mathcal{L}(\mathbf{w}) := ||\mathbf{y} - \mathbf{Xw}||_2^2, \end{equation}\]

which most commonly presents itself in Linear Regression where \(\mathbf{y}\in\mathbb{R}^N\) is the target response vector, \(\mathbf{X}\in\mathbb{R}^{N,M}\) is our observation or design matrix, and \(\mathbf{w}\in\mathbb{R}^{M}\) are our linear coefficients of interest. By differentiating and setting to 0, the solution can be analytically computed by solving

\[\begin{equation} \mathbf{X}^\text{T}\mathbf{Xw} = \mathbf{X}^\text{T}\mathbf{y}, \end{equation}\]

for \(\mathbf{w}\). This works well when our linear system of equations are over-determined i.e. \(N \gg M\) but for the case when we have a linear system of equations that is severely under-determined, it may be more efficient to solve

\[\begin{equation} \mathbf{XX}^\text{T}\mathbf{v} = \mathbf{y}, \end{equation}\]

for \(\mathbf{v}\) and then compute \(\mathbf{w} = \mathbf{X}^\text{T}\mathbf{v}\). Either way, we are looking to solve a system of linear equations of the form

\[\begin{equation} \mathbf{Ax} = \mathbf{b}, \end{equation}\]

for \(\mathbf{x}\) where \(\mathbf{A}\) is symmetric.

Note

Change of variables to match common notation in Mathematical literature.

When we have a large system of linear equations, we prefer to not directly invert \(\mathbf{A}\) and instead, resort to an iterative method and wish to minimise

(1)\[\phi(\mathbf{x}) := \frac{1}{2}\mathbf{x}^\text{T}\mathbf{Ax} - \mathbf{x}^\text{T}\mathbf{b}.\]

By differentiating Eq. 1 and setting to 0, we can show that the minimimum is achieved when we have found the solution to our original problem. We, again, constrain ourselves against computing this quantity using the analytic solution (as it may be computationally expensive), and instead seek an iterative method that updates our \(k\)-th guess of the solution. These class of methods are commonly known as linear gradient descent methods. For generality, they all have the following update rule:

\[\mathbf{x}_k = \mathbf{x}_{k-1} + \alpha_k\mathbf{p}_k\]

where \(\alpha_k\) is known as the learning rate and \(\mathbf{p}_k\) is the search direction. Below we state desirable properties we would like \(\mathbf{x}_k\) to have:

  • Existence.

    There exists some quantity \(\mathbf{x}_k\) such that \(||\mathbf{x}_k||_2^2 < \infty\) i.e. we do not want a solution that explodes. Why this is a good property to have becomes clearer when we consider equations in a physical context where \(\mathbf{x}_k\) may be bounded between reasonably sized numbers.

  • Uniqueness.

    Given \(\mathbf{x}_0\) is an arbitrary initialisation, for any \(\mathbf{x}_0\), \(\mathbf{x}_k \rightarrow \mathbf{x}_*\) as \(k \rightarrow \infty\) for some \(\mathbf{x}_*\).

  • Stability.

    Given an observation matrix \(\mathbf{A}\) and some response vector \(\mathbf{b}\), if we perturb \(\mathbf{A}\) and / or \(\mathbf{b}\) by some small noise i.e. use the data \(\mathbf{\tilde{A}}\) and \(\mathbf{\tilde{b}}\), where the tilde (~) notation represents a noisy representation of their respective non-noisy counter-parts, the solution \(\mathbf{\tilde{x}}_*\) should satisfy \(||\mathbf{\tilde{x}}_* - \mathbf{x}_*||_2^2 \le \epsilon\) for some reasonably small \(\epsilon\).

  • Monotonicity.

    We desire that with every iteration, we are closer to the true underlying solution i.e. \(||\mathbf{x}_k - \mathbf{x}_*||_2^2 < ||\mathbf{x}_{k-1} - \mathbf{x}_*||_2^2\ \forall\ k\in\{1, 2, 3,...\}\).

Here we show two different methods starting with the more naive approach.

Gradient Descent (Steepest Descent)

The naive approach. By setting the search direction as the gradient of Eq. 1, we can set a sufficiently small learning rate \(\alpha\) to update our \(k\)-th guess of \(\mathbf{x}_k\). By doing this there is no safe guarantee that we have monotonicity throughout the entire optimisation and if we set the learning rate too small, it may take far more iterations than required to converge. By using some simple Maths, we can compute the best learning rate at each iteration at every step of the minimisation process. See Gradient Descent for more details.

Conjugate Gradient

This is the most common linear optimiser in the literature with a much faster convergence than the previous methods. See Conjugate Gradient for more details.

Comparison

For illustration purposes, I have made a simple toy problem where the data and true solution are drawn from a Gaussian to show convergence rates for each of the described linear optimisers.

../../_images/comparison.png
 1from   ml_explained.jt.optimiser.linear import gradient_descent, conjugate_gradient
 2
 3import matplotlib.pyplot as plt
 4import seaborn           as sns
 5import numpy             as np
 6
 7sns.set_theme(style = 'whitegrid')
 8
 9# number of samples and feature dimension size
10N           = 100
11M           = 10
12
13# random number generator
14rng         = np.random.default_rng(0)
15
16# generate X from a standard Gaussian
17X           = rng.normal(size = (N, M))
18
19# generate target weights and response vector
20w           = rng.normal(size = M)
21y           = X @ w
22
23# define A and b for our optimisers
24A           = X.T @ X
25b           = X.T @ y
26x0          = np.zeros(M)
27
28# compute estimates of the target weight all starting from the 0-vector
29hat_gd      = gradient_descent(  A, b, x0, track = True, max_iter = 100, alpha = 1e-2)
30hat_sd      = gradient_descent(  A, b, x0, track = True, max_iter = 100, alpha = 'optimal')
31hat_cg      = conjugate_gradient(A, b, x0, track = True, max_iter = 100)
32
33# plot the cost function that is being minimised
34for hat, label in zip([hat_gd, hat_sd, hat_cg], ['gradient descent ($10^{-2}$)', 'gradient descent (optimal)', 'conjugate gradient']):
35    plt.semilogy(np.linalg.norm(hat @ A  - b, ord = 2, axis = 1) ** 2, label = label)
36
37# plot convergence threshold
38plt.hlines([1e-16] * 2, 0, 100, 'k', '--', label = 'threshold ($10^{-16}$)', zorder = -1)
39
40# aesthetics
41plt.grid(ls = (0, (5, 5)))
42plt.legend()
43plt.ylabel('$||\mathbf{b} - \mathbf{Ax}||_2^2$')
44plt.xlabel('No. Iterations')