Research Statement Lijun Ding, Cornell Univeristy October 19, 2020 My work lies at the intersection of optimization, statistics, and machine learning, where I works on solving large- scale and high dimensional optimization problems. My research has been broadly motivated by the technology trend of “high dimensional but structured data”: instead of having a few accurate data, we now face a huge amount of data that are often indirect, incomplete, have more variables and features than the number of samples, but with special structures, e.g., sparsity and low-rankness. This trend has resulted in large-scale optimization problems for decision making, statistical inference and predictions. However, these problems are traditionally considered intractable to solve due to (i) the high computational cost in handling the data and the optimization variables, and (ii) the potential ill-conditioning and NP hardness coming from nonconvexity due to modeling and from noisy/corrupted data generating processes. My work resolve the above issues by (i) justifying theoretically the good conditioning, the implicit regularization, and the benign geometry for these large scale optimization problems under mild and standard statistical assumptions, and (ii) developing robust, computational efficient and theoretically sound algorithms that can learn and utilize these special structure to solve these large-scale problems. To achieve both items, my research brings together fundamental ideas and sophisticated tools from a broad spectrum of areas including high-dimensional statistics, convex and non-convex optimization, and statistical learning theory. This work could have far-reaching impacts on both practitioners and theoreticians: (i) to practitioners, this new set of optimization algorithms allows for faster and yet robust decision making and predictions, (ii) to theoreticians, the lens of statistics and probability brings new insights to the fundamental researches of optimization in their application to large scale problems. In what follows, I shall first present my prior work on solving large-scale optimization problems, including classical convex optimization problems such as semidefinite programming, and newly arising nonconvex problems, based on the idea and techniques such as Frank-Wolfe, strict complementarity, and the leave-one-out argument. Then I present my plans for several important further directions including (i) bridging convex and nonconvex optimization algorithmically and theoretically ; (ii) efficient sampling and optimization in reinforcement learning. 1 Prior work 1.1 Semidefintie programming: regularity, conditioning, and efficient algorithm Semidefinite programming form a class of convex optimization problems with remarkable modeling power. Its application ranges from combinatorical problems such as max-cut, optics problems such as phase retrieval, graph network problem such as stochastic block model, to modern machine learning problem such as matrix completion and robust PCA. In all these applications, SDP has demonstrated the best theoretical guarantees once it is exactly solved. However, SDPs are challenging to solve because they involve a matrix variable X ∈ Rn×n whose dimension n can rise into the millions or billions. Most existing algorithms suffer from both high storage, Ω(n2), of storing the matrix variable, and the extraordinary high time complexity Ω(n3). Such high complexity forbids their usage in modern large-scale problems. Even worse, even assuming we can run these algorithms, it is not clear that these problems are well-conditioned or enough regular so that the traditional guarantees on the objective value can be translated to guarantees on metrics such as distance to solution which are more concerned in SDP applications. My research close the above gaps by (i) demonstrating that the traditionally considered regularity conditions do hold in these applications [DU20], (ii) design storage and time efficient algorithms for solving SDPs [DYC+19, DG20, DFU20], which converges merely under strong duality, and speed up to linear convergence if the further regularity conditions are satisfied. Hence they are well-suited for solving large scale SDPs. More precisely, in the work [DU20], I demonstrate that the traditionally considered regularity conditions such as strict complementarity, non-degeneracy of SDPs are satisfied for Z2 synchronization, stochastic block model, and matrix completion. Hence, accuracy measured by objective value does translated to the distance metric as shown in my work [DU]. In the work [DYC+19], I designed a new storage optimal (and potentially the first to my best knowledge) algorithm that provably solves SDP with sublinear convergence rate and only requires one eigenvector computation per iteration and solving one small-scale SDP with decision variable of size r, where r is the rank of the optimal solution which is expected to be much smaller than n in most of applications. In the followup works [DG20, DFU20], I further refined the algorithm, by exploration of the idea from classical spectral bundle method and Frank-Wolfe, so that it still keeps the sublinear rate yet achieve linear convergence rate once the regularity conditions are satisfied. 1.2 Statistical nonconvex optimization: analysis framework and efficient algorithm Modern optimization problems are usually nonconvex, either due to the modeling or the nature of the underlying mechanism. One important approach to solving these problems is convex relaxation, and it has achieved great theoretical success in terms of finding the underlying signal in the signal processing regime or its generalization power in the statistics and machine learning regime, once the convex relation is solved exactly. However, it is observed that most algorithms for solving the relaxations are slow in general, that is, they usually only enjoy a sublinear rate of convergence in both theory and practice. Thus, people turn to directly solving nonconvex low-rank optimization (i.e., optimization problems with rank constraints) via simple gradient based algorithm. Despite the remarkable empirical success achieved by these simple algorithms, there is a huge gap between this and the theory side of nonconvex algorithms. These simple algorithms lack theoretical guarantees, whereas existing algorithms with guarantees are often highly artificial and too complicated to be implemented in practice. To close the above gap, I have proposed a framework to analyze and explain the fast convergence of the simple gradient based algorithms. Applying this framework allowed me to solve an open problem posted back in 2009 [JMD10]. Moreover, inspired by the analysis of these simple algorithms, I have designed efficient, simple to implement and provably linearly/quadratically convergent algorithms. Let me explain my work on this line in more detail in the following two sections. Stochastic iteration procedure analysis framework Based on a novel technique, the “leave one out technique”, Professor Chen and I [DC20] introduced a general and powerful framework for analyzing an iterative stochastic procedure arising in nonconvex algorithms. This powerful framework enables us to solve an open problem from 2009 [JMD10], by establishing the convergence of singular value projection (SVP), a simple gradient based algorithm for solving low-rank matrix completion. In fact we show that SVP actually obtains a linear convergence rate. The power of our framework is not limited to just simple gradient algorithms, it can also be applied to other stochastic iteration procedures. Moreover, we also show that the sample complexity for a convex relaxation of low-rank matrix completion is near optimal (which is the best result in the literature) via this framework. Convex optimization based algorithms Inspired by the analysis of gradient based algorithms for solving non- convex problems, one of my works [CCD+19] developed algorithms that in practice only need to solve a few (three to five) small scale convex optimization problems, where each of these problems is not much costlier than a couple of iterations of gradient based algorithms. Under a statistical setting, by utilizing tools from random matrix theory, variational analysis, and spectral theory, we show our algorithm is almost free of tuning and achieves quadratic convergence when initialized properly. Empirically, this method not only converges fast but also can be initialized randomly when sample size is reasonably large. Averaged projected gradient method (AVPG) Having established algorithms that are well-suited for quadratic or norm based objective with nonconvex constraints, I further advanced my research to non-quadratic objective where familiar concepts such as restricted isometry are no longer satisfied, and standard nonconvex approaches such as gradient descent may perform badly in practice and have no satisfactory theory in guaranteeing global and efficient convergence. I demonstrate that the critical component in dealing with non-quadratic objective is a regularity projection oracle, which enables global and linear convergence of our proposed algorithm, AVPG [DZC20]. Our results apply to a wide range of non-quadratic problems including rank aggregation, one bit matrix sensing/completion, and more broadly generalized linear models with rank constraint. 2 Future Research 2.1 Bridging convex and nonconvex optimization: algorithm and theory Having explored both the convex and nonconvex worlds, I want to leverage the insights and algorithms from convex optimization and use them in solving nonconvex optimization problems. Algorithmically, I would like to design a method which gets the best from both worlds even under the setting of limited observations or weak signals: (1) it generates an output which generalizes both theoretically and empirically (as some convex methods) (2) it is simple to implement and runs very fast (like some nonconvex methods). To achieve this, I propose a hybrid approach based on convex relaxation and nonconvex methods, using approximate solutions from convex relaxations as the initialization for algorithms solving statistical nonconvex problems. The rationale behind this hybrid approach is two fold. First, in term of generalizability, there is hope that for various important settings, the approximate solution is just close enough to the true signal to be a good initialization point for nonconvex algorithms. Second, the proposed approach also enjoys low computation cost because in each phase, the initialization phase and the nonconvex algorithm phase, does not consume much computational budget. Indeed, in the initialization phase, though obtaining high accuracy in the optimal solution is usually not affordable, approximate solutions can be obtained with reasonable cost, e.g., an early stopping rule for iterative algorithms. In the second phase, many noncvonex algorithms are found to be very fast in practice and in theory when initialized appropriately. Theoretically, I want to understand when effective nonconvex algorithms exist if we already know that the convex relaxation admits generalizable solutions. The rational behind this is that the theory for convex relaxation is well-developed, while analyzing nonconvex problems usually requires lengthy, technical analysis, and certain ad-hoc arguments. More importantly, this should help us in build intuitive and robust algorithms for previously unexplored nonconvex problems. I would like to further explore this direction based on my previous work, which has identified a few conditions and algorithm oracles that ensure the translation of knowledge of convex optimization to nonconvex optimization, such as the regular projection oracle [DZC20], and the sharpness and growhth condition []. 2.2 Efficient sampling and optimization in reinforcement learning Tons of real-world problems are best modeled by reinforcement learning, where an agent has to sequentially makes decisions and receives rewards from the environment so that she could maximize the cumulative reward. The nonconvexity comes from the functional approximation to the value function or the parametrization of the policy space. The convergence of algorithms such as policy gradient and Q-learning and the landscape of the objective, the cumulative reward are complicated due to the functional approximation error and the intricate dependency along the trajectory of data. However, these heuristics has gained remarkable empirical success in games, robotics, and even traditional operations research area such as scheduling. My plan is to understand the intricate the statistical dependence and the geometry of the landscape via certain low rank structures of the Q-function, the transition function, or the reward function, and further design efficient sampling and optimization schemes. References [CCD+19] Vasileios Charisopoulos, Yudong Chen, Damek Davis, Mateo D´ıaz, Lijun Ding, and Dmitriy Drusvyatskiy. Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence. arXiv preprint arXiv:1904.10020, 2019. [DC20] Lijun Ding and Yudong Chen. Leave-one-out approach for matrix completion: Primal and dual analysis. IEEE Transactions on Information Theory, 2020. [DFU20] Lijun Ding, Jicong Fan, and Madeleine Udell. k fw: A frank-wolfe style algorithm with stronger subproblem oracles. arXiv preprint arXiv:2006.16142, 2020. [DG20] Lijun Ding and Benjamin Grimmer. Revisit of spectral bundle methods: Primal-dual (sub) linear convergence rates. arXiv preprint arXiv:2008.07067, 2020. [DU] Lijun Ding and Madeleine Udell. A strict complementary slackness approach to growth, error bound, and sensitivity of solution in conic programming. To appear soon. [DU20] Lijun Ding and Madeleine Udell. On the regularity and conditioning of low rank semidefinite programs. arXiv preprint arXiv:2002.10673, 2020. [DYC+19] Lijun Ding, Alp Yurtsever, Volkan Cevher, Joel A Tropp, and Madeleine Udell. An optimal- storage approach to semidefinite arXiv:1902.03373, 2019. programming using approximate complementarity. arXiv preprint [DZC20] Lijun Ding, Yuqian Zhang, and Yudong Chen. Low-rank matrix recovery with non-quadratic loss: projected gradient method and regularity projection oracle. arXiv preprint arXiv:2008.13777, 2020. [JMD10] Prateek Jain, Raghu Meka, and Inderjit S Dhillon. Guaranteed rank minimization via singular value projection. In Advances in Neural Information Processing Systems, pages 937–945, 2010.