Conjugate Gradient

Lead Author: Jordan

The method of conjugate gradient utilises past search directions when selecting the next search direction. At the \(k\)-th iteration not only do we know the current gradient \(\mathbf{r}_{k-1}\), we know all the previous gradients \(\{\mathbf{r}_0,...,\mathbf{r}_{k-2}\}\). By utilising this information we can search in some orthogonal space and converge much quicker than the method of steepest descent.

The general idea behind this algorithm is: since \(\alpha_k := \mathbf{p}_k^\text{T}\mathbf{r}_{k-1}\ /\ ||\mathbf{p}_k||_\mathbf{A}^2\), we are looking for a search direction \(\mathbf{p}_k\) where \(\mathbf{p}_k \neq \mathbf{r}_{k-1}\), as in the case of steepest descent, and \(\mathbf{p}_k^\text{T}\mathbf{r}_{k-1} \neq 0\). We want \(\mathbf{p}_k\) and \(\mathbf{x}_k\) to satisfy the following conditions:

  • \(\mathbf{p}_1,...,\mathbf{p}_k\) should be linearly independent.

  • \(\phi(\mathbf{x}_k) = \min_{\mathbf{x}\in\mathbf{x}_0 + \text{span}\{\mathbf{p}_1,...,\mathbf{p}_k\}}\phi(\mathbf{x})\).

  • \(\mathbf{x}_k\) can be calculated easily from \(\mathbf{x}_{k-1}\).

Consider the iterative update equation for \(\mathbf{x}_k\):

\[\begin{split}\begin{align} \mathbf{x}_1 &= \mathbf{x}_0 + \alpha_1\mathbf{p}_1\\ \mathbf{x}_2 &= \mathbf{x}_1 + \alpha_2\mathbf{p}_2 = \mathbf{x}_0 + \alpha_1\mathbf{p}_1 + \alpha_2\mathbf{p}_2\\ \vdots\ \ &\\ \mathbf{x}_k &= \mathbf{x}_0 + \mathbf{P}_{k-1}\mathbf{y}_k + \alpha_k\mathbf{p}_k \end{align}\end{split}\]

where \(\mathbf{P}_{k-1} = [\mathbf{p}_1,...,\mathbf{p}_{k-1}]\) with parameters \(\mathbf{y}_k\) and \(\alpha_k\). The objective is to determine the parameters \(\mathbf{y}_k\) and \(\alpha_k\):

\[\begin{split}\begin{align} \phi(\mathbf{x}_k) &= \frac{1}{2}(\mathbf{x}_0 + \mathbf{P}_{k-1}\mathbf{y}_k + \alpha_k\mathbf{p}_k)^\text{T}\mathbf{A}(\mathbf{x}_0 + \mathbf{P}_{k-1}\mathbf{y}_k + \alpha_k\mathbf{p}_k) - (\mathbf{x}_0 + \mathbf{P}_{k-1}\mathbf{y}_k + \alpha_k\mathbf{p}_k)^\text{T}\mathbf{b}\\ &= \phi(\mathbf{x}_0 + \mathbf{P}_{k-1}\mathbf{y}_k) + \alpha_k\mathbf{p}_k^\text{T}\mathbf{A}(\mathbf{x}_0 + \mathbf{P}_{k-1}\mathbf{y}_k) - \alpha_k\mathbf{p}_k^\text{T}\mathbf{b} + \frac{\alpha_k^2}{2}||\mathbf{p}_k||_\mathbf{A}^2\\ &= \textcolor{blue}{\phi(\mathbf{x}_0 + \mathbf{P}_{k-1}\mathbf{y}_k)} + \alpha_k\mathbf{p}_k^\text{T}\mathbf{A}\mathbf{P}_{k-1}\mathbf{y}_k + \textcolor{red}{\frac{\alpha_k^2}{2}||\mathbf{p}_k||_\mathbf{A}^2 - \alpha_k\mathbf{p}_k^\text{T}\mathbf{r}_0},\quad \text{as } \mathbf{b} - \mathbf{A}\mathbf{x}_0 = \mathbf{r}_0 \end{align}\end{split}\]

We tried to separate the \(\textcolor{blue}{\mathbf{y}_k}\) and \(\textcolor{red}{\alpha}\) terms in our calculations but have a mixed middle term. If we did not have this mixed term in the middle, we could just minimise over the two variables separately. Hence, we choose \(\mathbf{p}_k\) such that:

\[\mathbf{p}_k^\text{T}\mathbf{A}\mathbf{P}_{k-1} = \mathbf{0},\]

by employing the Gram-Schmidt orthonormalisation process

\[\mathbf{p}_k = \mathbf{r}_{k - 1} - \sum_{j = 1}^{k - 1} \frac{\langle\mathbf{r}_{k - 1},\ \mathbf{p}_j\rangle_\mathbf{A}}{||\mathbf{p}_j||_\mathbf{A}^2}\mathbf{p}_j,\]

and we are left with the following minimisation task:

\[\min_{\mathbf{x}_k\in\mathbf{x}_0+\text{span}\{\mathbf{p}_1,...,\mathbf{p}_k\}}\phi(\mathbf{x}_k) = \min_{\mathbf{y}}\big(\phi\left(\mathbf{x}_0 + \mathbf{P}_{k-1}\mathbf{y}\right)\big) + \min_{\alpha_k}\left(\frac{\alpha_k^2}{2}||\mathbf{p}_k||_\mathbf{A}^2 - \alpha_k\mathbf{p}_k^\text{T}\mathbf{r}_0\right)\]

The first minimisation problem is solved by \(\mathbf{y} = \mathbf{y}_{k-1}\) (this has already been calculated in the previous step) then \(\mathbf{x}_k=\mathbf{x}_0+\mathbf{P}_{k}\mathbf{y}_k\) satisfies

\[\phi(\mathbf{x}_k) = \min_{\mathbf{x}_0 + \text{span}\{\mathbf{p}_1,...,\mathbf{p}_k\}} \phi(\mathbf{x})\]

By completing the square, we can optimally compute \(\alpha_k\) for any search direction \(\mathbf{p}_k\) giving the result \(\alpha_k=\mathbf{p}_k^\text{T}\mathbf{r}_0/||\mathbf{p}_k||_\mathbf{A}^2=||\mathbf{r}_{k-1}||_2^2/||\mathbf{p}_k||_\mathbf{A}^2\). We can show this by considering

\[\begin{split}\begin{align} \mathbf{p}_k^\text{T}\mathbf{r}_{k-1} &= \mathbf{p}_k^\text{T}(\mathbf{b} - \mathbf{Ax}_{k-1})\\ &= \mathbf{p}_k^\text{T}(\mathbf{b} - \mathbf{A}(\mathbf{x}_0 + \mathbf{P}_{k-1}\mathbf{y}_{k-1}))\\ &= \mathbf{p}_k^\text{T}\mathbf{r}_0 - \mathbf{p}_k^\text{T}\mathbf{AP}_{k-1}\mathbf{y}_{k-1}\\ &= \mathbf{p}_k^\text{T}\mathbf{r}_0 \end{align}\end{split}\]

Further, it can be shown that \(\mathbf{r}_i^\text{T}\mathbf{r}_j = 0,\ \forall i\neq j\) as \(\mathbf{p}_i\) and \(\mathbf{r}_i\) span the same Krylov subspace i.e. \(\mathbf{r}_i\) form the orthogonal basis with respect to the standard inner product whilst \(\mathbf{p}_i\) form the orthogonal basis with respect to the inner product induced by \(\mathbf{A}\).

             \begin{algorithm}
 \caption{Conjugate Gradient}
 \begin{algorithmic}
 \PROCEDURE{ConjugateGradient}{$\mathbf{A}, \mathbf{b}, \mathbf{x}_0, \tau, K$}
 \STATE $\mathbf{r}_0 = \mathbf{b} - \mathbf{Ax}_0$
 \STATE $\mathbf{p}_1 = \mathbf{r}_0$
 \FOR{$k = 1$ \TO $K$}
     \IF{$||\mathbf{r}_{k-1}||_2 \le \tau$}
         \BREAK
     \ENDIF
     \STATE $\alpha_k = ||\mathbf{r}_{k-1}||_2^2 / ||\mathbf{p}_{k}||_\mathbf{A}^2$
     \STATE $\mathbf{x}_k = \mathbf{x}_{k-1} + \alpha_k\mathbf{p}_k$
     \STATE $\mathbf{r}_k = \mathbf{r}_{k-1} - \alpha_k\mathbf{Ap}_{k}$
     \STATE $\beta_k = ||\mathbf{r}_k||_2^2 / ||\mathbf{r}_{k-1}||_2^2$
     \STATE $\mathbf{p}_{k+1} = \mathbf{r}_k + \beta_k\mathbf{p}_{k}$
 \ENDFOR
 \RETURN $\mathbf{x}_k$
 \ENDPROCEDURE
 \end{algorithmic}
 \end{algorithm}
        

Convergence

The conjugate gradient method has the following convergence rate

\[||\mathbf{e}||_\mathbf{A}^2 \leq 2\left(\frac{\sqrt{K_2(\mathbf{A})} - 1}{\sqrt{K_2(\mathbf{A})} + 1}\right)^k||\mathbf{e}_0||_\mathbf{A}^2 = 2\left(1 - \frac{2}{\sqrt{K_2(\mathbf{A})} + 1}\right)^k||\mathbf{e}_0||_\mathbf{A}^2.\]

Note

The complete derivation is a little complex and lengthy so see the following for more details:

  • Polyak, Boris (1987). Introduction to Optimization, p68-74

  • Hackbusch, W. (2016). Iterative Solution of Large Sparse Systems of Equations (2nd ed.), p238-243