Mathematical Methods for Optimization

10.2 Quasi-Newton methods: BFGS and DFP updates

Citation:

Quasi-Newton methods are clever tricks to speed up optimization without the heavy lifting of Newton's method. They approximate the inverse Hessian matrix, using info from previous steps to get better and better guesses.

BFGS and DFP are two popular Quasi-Newton methods. They use different formulas to update the approximation, but both aim for superlinear convergence. BFGS is generally more robust and efficient in practice.

Quasi-Newton Methods Motivation

Approximating the Inverse Hessian

Quasi-Newton methods iteratively approximate the inverse Hessian matrix without explicit computation
Address computational cost of Newton's method by maintaining and updating an approximation at each iteration
Utilize information from previous iterations to improve the approximation
Aim to achieve superlinear convergence (faster than first-order methods, typically slower than Newton's quadratic convergence)
Require only first-order derivative information (gradients) for computational efficiency in large-scale problems

Key Principles and Relationships

Secant condition ensures approximation of the Hessian satisfies the same property as the true Hessian for the most recent step
Relationship to Newton's method involves using quadratic models of the objective function to determine search directions
Quasi-Newton methods use approximations instead of exact Hessians (Newton's method)
Examples of quasi-Newton methods include BFGS (Broyden-Fletcher-Goldfarb-Shanno) and DFP (Davidon-Fletcher-Powell)

BFGS and DFP Formulas

Derivation and Properties

BFGS update formula derived using weighted average of two rank-one updates
Combines information from current step and change in gradient
DFP update formula derived as sum of current approximation and two rank-one updates
Both formulas expressed in terms of current approximation, step vector, and difference in gradients between iterations
Sherman-Morrison-Woodbury formula crucial for efficient implementations
Allows direct updates of inverse Hessian approximation

Comparison of Update Formulas

BFGS and DFP updates satisfy secant condition
Maintain positive definiteness of approximation matrix
Ensure descent directions in optimization
Differ in weighting of information from current iteration
Lead to different numerical properties and performance characteristics
BFGS update: $B_{k+1} = B_k - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k} + \frac{y_k y_k^T}{y_k^T s_k}$
DFP update: $B_{k+1} = B_k - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k} + \frac{y_k y_k^T}{y_k^T s_k}$ Where $B_k$ is the current approximation, $s_k$ is the step vector, and $y_k$ is the difference in gradients

Implementing BFGS and DFP

Algorithm Structure

Initialize inverse Hessian approximation (typically identity matrix or scaled version)
Compute search direction: $p_k = -B_k \nabla f(x_k)$
Perform line search to determine step size $\alpha_k$
Update solution: $x_{k+1} = x_k + \alpha_k p_k$
Update inverse Hessian approximation using BFGS or DFP formula
Efficient BFGS implementation uses Sherman-Morrison-Woodbury formula for direct updates
DFP update implemented directly on inverse Hessian approximation

Implementation Considerations

Careful implementation of line search procedures ensures sufficient decrease in objective function
Maintains positive definiteness of approximation
Numerical stability safeguards against division by zero
Maintains symmetry of approximation for robust implementations
Termination criteria involve checking norm of gradient, change in function value, or step size against predefined tolerances
Example termination condition: $\|\nabla f(x_k)\| < \epsilon$ or $|f(x_k) - f(x_{k-1})| < \delta$

BFGS vs DFP Performance

Convergence Properties

BFGS and DFP exhibit superlinear convergence under certain conditions
BFGS converges faster and more reliably in practice
BFGS demonstrates better self-correcting properties
Recovers from poor approximations more effectively than DFP
Both methods sensitive to accuracy of line search
BFGS somewhat less sensitive than DFP

Practical Considerations

BFGS generally considered more robust and efficient than DFP
Particularly effective for problems with inaccurate line searches or ill-conditioned Hessians
Memory requirements similar for BFGS and DFP (O(n^2) storage for inverse Hessian approximation)
Numerical experiments show BFGS outperforms DFP in number of function evaluations and iterations to reach convergence
Example optimization problem: Rosenbrock function minimization
BFGS typically requires fewer iterations (20-30) compared to DFP (30-40) for convergence
Relative performance varies depending on specific optimization problem
Rare cases exist where DFP may outperform BFGS (highly nonlinear problems with specific structures)

Table of Contents

📊mathematical methods for optimization review

10.2 Quasi-Newton methods: BFGS and DFP updates

Quasi-Newton Methods Motivation

Approximating the Inverse Hessian

Key Principles and Relationships

BFGS and DFP Formulas

Derivation and Properties

Comparison of Update Formulas

Implementing BFGS and DFP

Algorithm Structure

Implementation Considerations

BFGS vs DFP Performance

Convergence Properties

Practical Considerations

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes