Lecture Review: Second Order Methods for Nonsmooth Nonconvex Optimization
On August 5, Professor Andre Milzarek from the research group on numerical methods for AI applications gave a lecture on the topic: “Second order methods for nonsmooth nonconvex optimization”. Professor Milzarek is an Assistant Professor at the School of Data Science of The Chinese University of Hong Kong, Shenzhen and a member of AIRS and of the Shenzhen Research Institute of Big Data (SRIBD). Professor Michael Choi, (CUHK-SZ and AIRS), acted as the host of the lecture.
Professor Milzarek's talk focused on the discussion of deterministic and stochastic optimization approaches for structured optimization problems that utilize and exploit (stochastic or approximate) higher order information. The presentation started with an introduction and explanation of the underlying problem formulation. In his talk, Professor Milzarek considered a class of composite-type minimization problems where the objective function can be written as the sum of a smooth (but possibly nonconvex) and a convex (but possibly nonsmooth) function. The smooth part of the objective function typically corresponds to a loss model that measures the error between the given data and the obtained optimization solution. The nonsmooth part often plays the role of a regularization term that promotes a special structure, such as sparsity, group-sparsity, and low-rank or that allows to model constraints. A vast variety of applications can be covered by this problem formulation such as, e.g., large-scale machine learning problems, LASSO, sparse logistic regression, imaging problems, low-rank matrix completion, and dictionary learning.
Professor Milzarek then talked about the necessary background and the foundations of the second order approaches. The success of many nonsmooth optimization methods is based on the following observation: using the proximity operator of the nonsmooth part of the objective function, the first order necessary optimality conditions can be equivalently expressed as a nonsmooth equation. The principal idea is now to apply a nonsmooth variant of Newton's method – the so-called semismooth Newton method – to solve this equation and to accelerate the convergence and performance of existing first order schemes such as the proximal gradient method. The structure of a semismooth Newton step is very similar to a traditional Newton step for smooth problems. However, due to the lack of differentiability, a set of generalized derivatives has to be used and introduced. Under suitable assumptions, this procedure then generates a sequence of iterates that converges locally fast (q-superlinearly) to a stationary point of the problem.
Unfortunately, the semismooth Newton method has similar limitations as the classical Newton method. In particular, we can only guarantee local convergence if the iteration process is started sufficiently close to a solution or stationary point of the problem. In order to resolve this drawback, suitable globalization strategies need to be designed and developed. Professor Milzarek presented a new globalization mechanism and algorithm that is based on Robinson's normal map. The idea of the approach is to embed the semismooth Newton step for the normal map in a trust region framework to achieve and ensure global convergence. The analysis and algorithm utilizes a reduction ratio test on a new merit function to check the quality of the resulting higher order trust region steps. Combining these different techniques, favorable convergence results can be achieved:
-Every accumulation point of a sequence generated by the normal map-based trust region method is a stationary point.
-The Kurdyka-Lojasiewicz theory is applicable ensuring stronger convergence and rates for nonconvex problems.
-Transition to fast local convergence is possible.
Professor Milzarek then discussed and compared the performance of the method on a challenging nonconvex image compression task. In the application, we seek an optimal mask c that selects pixels from a given ground truth image. The mask should select as few pixels as possible to maximize the compression rate while maintaining a high reconstruction quality. An exemplary mask and recovered image are shown in Figure 1.
Figure 1: A nonconvex image compression task. On left the computed pixel mask c is shown. The mask has density 4.7%. The right image shows the reconstructed image when only using pixels according to the mask c.
A numerical comparison of the performance of the trust region approach (trssn) and of an inertial proximal gradient method (ipiano) is shown in Figure 2. The proposed higher order approach outperforms ipiano and recovers accurate masks and solutions using much less cpu-time.
Figure 2: Comparison of trssn and ipiano. The figure shows the change of the stationarity measure with respect to the required cpu-time.
In the second part of his talk, Professor Milzarek introduced a stochastic higher order extra-step scheme for the same class of composite-type problems. This stochastic approach is motivated by big data applications and large-scale learning tasks, where it is no longer tractable to evaluate the objective function and to compute the full gradient and Hessian. Instead, stochastic or inexact oracles, such as sub-sampling techniques, are used to approximate the gradient and the curvature information. The idea of the method is simple: we first perform a stochastic version of the semismooth Newton step and, in order to guarantee convergence, an additional stochastic proximal gradient step is computed. The extra-step method is very flexible and different step sizes, stochastic oracles, and higher order schemes can be utilized.
Next, Professor Milzarek presented global convergence results which mainly relied on a correct balance of the step sizes and the variance induced by the gradient approximations. It was shown that the generated stochastic process approaches stationarity almost surely and in expectation. Professor Milzarek then discussed extensions, possible combinations with variance reduction techniques, and how cheap stochastic quasi-Newton-type directions (that incorporate stochastic higher order information) can be generated. In the final part of the talk, various numerical results on sparse logistic regression and sparse deep learning were discussed. The numerical comparisons illustrate the strong performance of the proposed stochastic extra-step scheme as well as the efficiency, effectiveness, and potential of stochastic higher order information in large-scale learning tasks.
At the end of the lecture, Professor Milzarek interacted with the audience and answered several questions related to the sparse deep learning model and the performance of trssn and ipiano.