A fundamental practical barrier to deploying personalized data-driven decision-making systems is the inability to credibly and safely evaluate performance on previously-collected real-world data without live deployment. For example in healthcare, an important goal is to learn from electronic health records, which are widely available unlike costly randomizedcontrolled- trial data, in order to estimate and optimize the value of a personalized treatment policy. Unlike classical operational settings where decisions are made on objects with known dynamics, in human-centered systems such as e-commerce and healthcare, the personalized causal effects of actions are unknown and need to be statistically estimated. Although there has been great progress at the interface of causal inference and machine learning, incorporating these techniques requires certain assumptions that are widely known to not hold in practice. My work on credible methodology bridges the gap between theory and practice by empowering analysts and practitioners to optimize robust decisions or report bounds on inferential parameters under realistic and practitioner-tunable violations of assumptions. Credible performance assessment is also key for emerging applications in algorithmic fairness. Recent controversies about the use of machine learning for risk assessment in criminal justice, lending in financial services, and the provision of social services, emphasize the importance of stronger individual-or subgroup-level performance guarantees, although these same application areas pose practical barriers to measuring disparities to ensure equitable performance. More broadly, my research interests are at the interface of statistical machine learning and operations research. In my dissertation work, I also draw on and contribute to causal inference. My ultimate goals are to develop reliable, effective, credible, and equitable personalized data-driven decision-making so that machine learning can be deployed in important applications for beneficial impacts on firms, individuals, and society. To do so, I (1) identify common statistical structure and desiderata from challenging and motivating applications, (2) establish the theoretical foundations and performance guarantees for novel methodology, (3) develop practical algorithmic frameworks, and (4) articulate managerial insights to illustrate the relevance of my “effective and credible” lens for researchers and practitioners working in these motivating settings. Methodologically, I often design robust and credible estimators and algorithms under a unifying viewpoint of optimization under ambiguity and prove statistical convergence guarantees by leveraging optimization structure. Below, I describe my work in more detail along the parallel methodological threads of robust personalization from observational data and credible fairness and impact assessment. Robust personalization from observational data ([2, 8, 6]). One line of work focuses on learning to improve personalized decisions from observational data in the presence of unobserved confounders. In “Minimax-Optimal Policy Learning Under Unobserved Confounding” (accepted at Management Science) [2], I provide a practical algorithmic framework and statistical theoretical guarantees. Motivation and problem setting. Statistical confounding occurs when historical decisions depend on confounders which also affect the outcome. As a result, naive estimation of causal effects on a historical dataset is conditional on the historical selection pattern, and causal effect estimation is biased. Methods from causal inference adjust for this selection bias due to observed confounders by assuming selection on observables or unconfoundedness: that outcomes are conditionally independent of treatment assignment upon adjusting for observed covariates. Unconfoundedness is standard to justify most causal inference adjustment methods, and all recent progress in causal inference and machine learning assumed it. However, unconfoundedness is often violated in practice by design of the operational environment. For example, in healthcare, physicians made their decisions based on additional information that is not recorded, such as intuition about patient presentation. This is true more broadly of decisions made by experts, or decisions that reflect prior optimization. In our paper, we illustrate the relevance of our approach with an extensive case study built on the Women’s Health Initiative parallel observational study, which was subject to unobserved confounding from self-selection into elective treatment, and clinical trial, which would otherwise be the gold-standard for causal inference. The observational study originally suggested treatment might be beneficial for chronic disease prevention; while the clinical trial had to be halted early due to the increased incidence of deaths. A common explanation was that unobserved confounding led to overall healthier women being enrolled in the treatment in the observational study. Contributions and challenges. In a setting with unobserved confounders, the data analyst only observes i.i.d draws from the joint distribution of covariate, treatment, and outcome data, (X, T, Y ), for observed treatment tuples, but the underlying data, (U, X, T, Y ), was generated with an additional unobserved confounder, U, which influences treatment and outcome. If we had access to the full underlying data, it would be sufficient to learn P(T | X, U), the true probability of treatment assignment (propensity score) in order to statistically adjust for selection bias via the likelihood ratio, e.g. by importance sampling. A data-driven approach recognizes that we may estimate P(T | X), which adjusts for some but not all confounding. I develop a nonparametric minimax approach to learn a machine learning policy with the best worst-case guarantee over an ambiguity set on the inverse propensity weights. The ambiguity set U restricts deviations of the true underlying inverse propensity weight W ∗ = 1/P(T |X,U) from what we can estimate from observed data, W = 1/P(T |X). These restrictions are parametrized according to the marginal sensitivity model from causal inference which translates to an interval uncertainty set of W ∗ with respect to W . The overall “size” of the set is parameterized by a scalar parameter, Γ: an analyst presumably specifies bounds on plausible ranges on the extent of residual unobserved confounding. A personalized decision policy π(X) maps covariates to the probability of an action. For the purposes of interpretability and generalization, we optimize personalized decision policies over a parametrized function class such as the family of decision rules based on a linear index, π(X)= I[β>X> 0], or logistic assignment. When the treatment contrast is coded T ∈ {−1, +1}, the standard policy learning problem solves maxπ E[Y TπW ], e.g. by analogy to weighted classification. We solve a minimax problem, optimizing over the robust worst-case regret relative to a baseline policy π0 over the ambiguity set, to find a confounding-robust policy: E[YT (π − π0)W ] min max πW ∈U E[W ] The baseline policy is simple, such as all-treat or all-control, and improves our safety guarantees. We leverage the special linear-fractional optimization structure to develop a computationally efficient algorithmic framework for learning a robust policy. We take a robust gradient descent approach: for parametrized policies, we solve the inner optimization to full optimality via a ternary search procedure and evaluate gradients at the worst-case solution. Our setting introduces technical challenges. In contrast to robust decision-making, unobserved confounders result in ambiguity about both the underlying generating distribution and the realizations that appear in our dataset. Obtaining statistical convergence results of our robust machine-learning decision rule requires problem-specialized structural characterization of the optimal subproblem solution or otherwise proving stability of the optimization problem. Estimated nuisance parameters introduce perturbations to left-hand side constraints. Our theoretical guarantees recover the same rate of convergence of the previous non- robust approach with our novel safety/improvement guarantees: with high probability, the sample-minimax optimal policy deviates from the population-minimax optimal policy by a vanishing Op(n− 21 ) term, where n is the number of confounded samples. Therefore, our results show that researchers, analysts, and practitioners can develop confounding-robust personalized decision policies at no great cost computationally nor statistically. Follow-up work: infinite-horizon reinforcement learning. While minimax-policy learning studies the single-stage decision case, in healthcare there is increasing interest in managing chronic health conditions over time, such as insulin dosing in diabetic control, which requires offline reinforcement learning and off-policy evaluation in the sequential setting. However, data collected from electronic medical records in this setting is often observational and hence subject to problems of unobserved confounders. We study robust policy evaluation in [6] in the infinite horizon reinforcement learning setting. Instead of a direct generalization of the robust estimator approach used in [2], which would become “exponentially robust” in the time horizon, we build on a recently proposed estimating equation for the stationary distribution density ratio. Assuming i.i.d unobserved confounders, we optimize robust bounds on an estimating equation for the density ratio of state occupancy measures. Credible impact evaluation in algorithmic fairness ([7, 3, 4, 5]). In another line of work, I develop robust, credible methodology for disparity assessment for the growing area of algorithmic fairness and impact assessment in consequential settings more broadly. A key question in fair data-driven decision-making is to trade-off and assess disparities in the performance of decision rules, in particular along the dimensions of a “protected attribute” upon which discrimination is prohibited, such as race or sex. I develop tools and crisp managerial insights to align the performance benchmarks, by which we measure machine learning progress and ultimately justify the deployment of algorithms, with the real-world actual operating conditions and impacts of algorithms in consequential settings. In “Assessing Disparities with Unobserved Protected Class” (accepted at Management Science) [7], we develop methodology for assessing bounds on disparities in settings where the protected attribute of interest is actually not recorded in the dataset of decision recommendations, decisions, and covariates (Y, ˆ Y,Z), but auxiliary data on the protected attribute and covariates (A, Z) is available. This is a common setting in practice, such as in financial services, where even regulators rely on proxy methods that estimate P(A | Z) but require untestable assumptions. Thus, disparities crucially rely on the unobserved conditional joint distribution although we only have access to the observed marginal distributions. We take a partial identification approach and estimate the sets that correspond to all possible disparities that are consistent with the observed data. Disparities are functionals that can be represented as marginal averages of the unknown P(Y, Yˆ | A, Z) with respect to the estimable measure P(A | Z). While P(Y, Yˆ | A, Z) is an unknown quantity, it satisfies properties of all probability distributions, such as boundedness and the law of total probability, and its marginalization over A, P(Y, Yˆ | Z), is estimable. We therefore optimize over the ambiguous probability distribution P(Y, Yˆ | A, Z) subject to these constraints which arise from the underlying probability distribution structure. A unifying optimization framework allows for adding additional structural assumptions and developing a general-purpose algorithm for recovering the set of disparities which are supported by the observed data in increasingly complex settings. We use a support function representation of the partial identification set in order to computationally represent the convex hull of the disparity set. Our work provides tools to assess the basic limits of what can be concluded from the data; and can be used to either verify that assessed disparities remain (regardless of the assumptions induced by any further estimation procedure), or that investment in further refinement of auxiliary data is necessary. Other fairness work In [3], we show that in some consequential settings where fairness in machine learning has been of concern – lending, criminal justice, and social services – all machine learning methods are necessarily trained on data censored by previous decisions. A key insight is that not only does the standard machine learning assumption of representative data fail, its failure is an inevitable consequence of practical settings, and attempts to mitigate disparities could inadvertently continue to disadvantage those who were harmed by non-representation under the guise of fairness. Other work proposes and studies principled evaluation metrics that better align with stated stakeholder desiderata as in the xAUC metric [5] to deliver crisp insights to inform practitioners in these areas, or develops robust evaluation for more challenging settings such as causal interventions [4]. These insights are directly targeted towards researchers and practitioners; the xAUC metric was recently used in a large-scale study of fairness in clinical risk scores [9]. I am particularly interested in studying practical opportunities to improve reliability and outcomes from machine learning methods without relying on approaches that penalize model performance. During my research internship with Miro Dudik and Jenn Wortman Vaughan, motivated by a study of what business practitioners do in practice to improve model disparities [1], I developed theory for and empirically studied when the common practical intervention of collecting more data may improve fairness properties of regression models by developing approximations to group-conditional finite-sample error which an analyst may use to guide the collection of additional data to improve disparities most efficiently. Ongoing and future research: In ongoing research, I am continuing work on off-policy learning in the sequential setting. The key insight is that commonly in operations research, a subset of the state variable undergoes known dynamics, conditional on a (possibly contextual or covariate-conditional) effect. Therefore the global state variable (e.g. inventory or resource consumption) introduces sequential dependence although the contextual personalized effect is time-invariant. This setting bridges the single-stage and dynamic settings and highlights the interface of statistical machine learning and operations research: the structure of OR problems provides opportunities to leverage optimization structure in order to improve upon off-policy learning algorithms and results that rely solely on statistical estimation. In recently submitted work, I have developed a framework to study questions of algorithmic fairness in relation to personalized pricing. The setting of personalized pricing can illustrate the benefits of greater personalization for expanding access directing resources to those who “value” them the most, but many real-world challenges, including the joint covariance structure of groups and covariates or finite-sample uncertainty can lead to inequities in resulting allocations. An important generalization for further work is studying the question of fairness in decision utility of the contextual version of classical operations research stochastic optimization problems which newly incorporate machine-learned predictors. For future research, I plan on continuing my research on reliable, robust, and trustworthy machine learning more broadly, with attention to the interface of optimization and estimation for making better decisions from data. For example, jointly considering the robust bias guarantees of the sensitivity parameter alongside the estimation variance can lead to endto- end guarantees for accounting for both ambiguity and finite-sample uncertainty. Other research directions continue to develop the interface between methodological theory and practical settings and particularly benefit from the operations perspective. One research direction recognizes that, as my work in fairness in pricing argues, in broader application settings, short-term fairness or welfare considerations may be of interest because of longer- term considerations such as consumer retention or future value. It would be important to study this empirically as well as develop algorithmic personalization schemes that better manage long-term system performance. Another research direction recognizes that while my previous work considered the case of unobserved confounders, the opposite challenge for algorithmic decision support is understanding how “humans-in-the-loop” should use additional information available to decision-makers at the time of deployment. This is a fundamental problem that limits the practical impact of prescriptions from algorithms in practice, is not considered by conventional theory, and relates to statistical bias-variance tradeoffs. Over the long term, I plan on continuing to tailor statistical machine learning methodological development to the challenges of emerging application areas. References [1] K. Holstein, J. Wortman Vaughan, H. Daum´e III, M. Dudik, and H. Wallach. Improving fairness in machine learning systems: What do industry practitioners need? In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pages 1–16, 2019. [2] N. Kallus and A. Zhou. Minimax-optimal policy learning under unobserved confounding. Management Science (Accepted), supersedes Neurips 2018 version. [3] N. Kallus and A. Zhou. Residual unfairness in fair machine learning from prejudiced data. ICML, pages 2439–2448, 2018. [4] N. Kallus and A. Zhou. Assessing disparate impact of personalized interventions: Identifiability and bounds. Neurips, pages 3426–3437, 2019. [5] N. Kallus and A. Zhou. The fairness of risk scores beyond classification: Bipartite ranking and the xauc metric. Neurips, pages 3438–3448, 2019. [6] N. Kallus and A. Zhou. Confounding-robust policy evaluation in infinite-horizon reinforcement learning. Neurips, 2020. [7] N. Kallus, X. Mao, and A. Zhou. Assessing algorithmic fairness with unobserved protected class using data combination. Management Science (Accepted). A preliminary version appeared at FaCCT 2020 as an extended abstract. [8] N. Kallus, X. Mao, and A. Zhou. Interval estimation of individual-level causal effects under unobserved confounding. AISTATS, pages 2281–2290, 2019. [9] S. R. Pfohl, A. Foryciarz, and N. H. Shah. An empirical characterization of fair machine learning for clinical risk prediction. arXiv preprint arXiv:2007.10306, 2020.