Conjugate Gradient
Lead Author: Jordan
The method of conjugate gradient utilises past search directions when selecting the next search direction. At the \(k\)-th iteration not only do we know the current gradient \(\mathbf{r}_{k-1}\), we know all the previous gradients \(\{\mathbf{r}_0,...,\mathbf{r}_{k-2}\}\). By utilising this information we can search in some orthogonal space and converge much quicker than the method of steepest descent.
The general idea behind this algorithm is: since \(\alpha_k := \mathbf{p}_k^\text{T}\mathbf{r}_{k-1}\ /\ ||\mathbf{p}_k||_\mathbf{A}^2\), we are looking for a search direction \(\mathbf{p}_k\) where \(\mathbf{p}_k \neq \mathbf{r}_{k-1}\), as in the case of steepest descent, and \(\mathbf{p}_k^\text{T}\mathbf{r}_{k-1} \neq 0\). We want \(\mathbf{p}_k\) and \(\mathbf{x}_k\) to satisfy the following conditions:
\(\mathbf{p}_1,...,\mathbf{p}_k\) should be linearly independent.
\(\phi(\mathbf{x}_k) = \min_{\mathbf{x}\in\mathbf{x}_0 + \text{span}\{\mathbf{p}_1,...,\mathbf{p}_k\}}\phi(\mathbf{x})\).
\(\mathbf{x}_k\) can be calculated easily from \(\mathbf{x}_{k-1}\).
Consider the iterative update equation for \(\mathbf{x}_k\):
where \(\mathbf{P}_{k-1} = [\mathbf{p}_1,...,\mathbf{p}_{k-1}]\) with parameters \(\mathbf{y}_k\) and \(\alpha_k\). The objective is to determine the parameters \(\mathbf{y}_k\) and \(\alpha_k\):
We tried to separate the \(\textcolor{blue}{\mathbf{y}_k}\) and \(\textcolor{red}{\alpha}\) terms in our calculations but have a mixed middle term. If we did not have this mixed term in the middle, we could just minimise over the two variables separately. Hence, we choose \(\mathbf{p}_k\) such that:
by employing the Gram-Schmidt orthonormalisation process
and we are left with the following minimisation task:
The first minimisation problem is solved by \(\mathbf{y} = \mathbf{y}_{k-1}\) (this has already been calculated in the previous step) then \(\mathbf{x}_k=\mathbf{x}_0+\mathbf{P}_{k}\mathbf{y}_k\) satisfies
By completing the square, we can optimally compute \(\alpha_k\) for any search direction \(\mathbf{p}_k\) giving the result \(\alpha_k=\mathbf{p}_k^\text{T}\mathbf{r}_0/||\mathbf{p}_k||_\mathbf{A}^2=||\mathbf{r}_{k-1}||_2^2/||\mathbf{p}_k||_\mathbf{A}^2\). We can show this by considering
Further, it can be shown that \(\mathbf{r}_i^\text{T}\mathbf{r}_j = 0,\ \forall i\neq j\) as \(\mathbf{p}_i\) and \(\mathbf{r}_i\) span the same Krylov subspace i.e. \(\mathbf{r}_i\) form the orthogonal basis with respect to the standard inner product whilst \(\mathbf{p}_i\) form the orthogonal basis with respect to the inner product induced by \(\mathbf{A}\).
\begin{algorithm}
\caption{Conjugate Gradient}
\begin{algorithmic}
\PROCEDURE{ConjugateGradient}{$\mathbf{A}, \mathbf{b}, \mathbf{x}_0, \tau, K$}
\STATE $\mathbf{r}_0 = \mathbf{b} - \mathbf{Ax}_0$
\STATE $\mathbf{p}_1 = \mathbf{r}_0$
\FOR{$k = 1$ \TO $K$}
\IF{$||\mathbf{r}_{k-1}||_2 \le \tau$}
\BREAK
\ENDIF
\STATE $\alpha_k = ||\mathbf{r}_{k-1}||_2^2 / ||\mathbf{p}_{k}||_\mathbf{A}^2$
\STATE $\mathbf{x}_k = \mathbf{x}_{k-1} + \alpha_k\mathbf{p}_k$
\STATE $\mathbf{r}_k = \mathbf{r}_{k-1} - \alpha_k\mathbf{Ap}_{k}$
\STATE $\beta_k = ||\mathbf{r}_k||_2^2 / ||\mathbf{r}_{k-1}||_2^2$
\STATE $\mathbf{p}_{k+1} = \mathbf{r}_k + \beta_k\mathbf{p}_{k}$
\ENDFOR
\RETURN $\mathbf{x}_k$
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}
Convergence
The conjugate gradient method has the following convergence rate
Note
The complete derivation is a little complex and lengthy so see the following for more details:
Polyak, Boris (1987). Introduction to Optimization, p68-74
Hackbusch, W. (2016). Iterative Solution of Large Sparse Systems of Equations (2nd ed.), p238-243