A fundamental practical barrier to deploying personalized data-driven decision-making
systems is the inability to credibly and safely evaluate performance on previously-collected
real-world data without live deployment. For example in healthcare, an important goal is
to learn from electronic health records, which are widely available unlike costly randomizedcontrolled-
trial data, in order to estimate and optimize the value of a personalized treatment
policy. Unlike classical operational settings where decisions are made on objects with known
dynamics, in human-centered systems such as e-commerce and healthcare, the personalized
causal effects of actions are unknown and need to be statistically estimated. Although
there has been great progress at the interface of causal inference and machine learning,
incorporating these techniques requires certain assumptions that are widely known to not
hold in practice. My work on credible methodology bridges the gap between theory
and practice by empowering analysts and practitioners to optimize robust decisions or
report bounds on inferential parameters under realistic and practitioner-tunable violations
of assumptions. Credible performance assessment is also key for emerging applications
in algorithmic fairness. Recent controversies about the use of machine learning for risk
assessment in criminal justice, lending in financial services, and the provision of social
services, emphasize the importance of stronger individual-or subgroup-level performance
guarantees, although these same application areas pose practical barriers to measuring
disparities to ensure equitable performance.
More broadly, my research interests are at the interface of statistical machine learning
and operations research. In my dissertation work, I also draw on and contribute to causal
inference. My ultimate goals are to develop reliable, effective, credible, and equitable
personalized data-driven decision-making so that machine learning can be deployed in
important applications for beneficial impacts on firms, individuals, and society. To do so,
I (1) identify common statistical structure and desiderata from challenging and motivating
applications, (2) establish the theoretical foundations and performance guarantees for novel
methodology, (3) develop practical algorithmic frameworks, and (4) articulate managerial
insights to illustrate the relevance of my “effective and credible” lens for researchers and
practitioners working in these motivating settings. Methodologically, I often design robust
and credible estimators and algorithms under a unifying viewpoint of optimization under
ambiguity and prove statistical convergence guarantees by leveraging optimization structure.
Below, I describe my work in more detail along the parallel methodological threads of
robust personalization from observational data and credible fairness and impact assessment.
Robust personalization from observational data ([2, 8, 6]). One line of work focuses
on learning to improve personalized decisions from observational data in the presence of
unobserved confounders. In “Minimax-Optimal Policy Learning Under Unobserved Confounding”
(accepted at Management Science) [2], I provide a practical algorithmic framework and
statistical theoretical guarantees.
Motivation and problem setting. Statistical confounding occurs when historical
decisions depend on confounders which also affect the outcome. As a result, naive estimation
of causal effects on a historical dataset is conditional on the historical selection pattern, and
causal effect estimation is biased. Methods from causal inference adjust for this selection bias
due to observed confounders by assuming selection on observables or unconfoundedness: that
outcomes are conditionally independent of treatment assignment upon adjusting for observed
covariates. Unconfoundedness is standard to justify most causal inference adjustment methods,
and all recent progress in causal inference and machine learning assumed it.
However, unconfoundedness is often violated in practice by design of the operational
environment. For example, in healthcare, physicians made their decisions based on additional
information that is not recorded, such as intuition about patient presentation. This is true
more broadly of decisions made by experts, or decisions that reflect prior optimization.
In our paper, we illustrate the relevance of our approach with an extensive case study
built on the Women’s Health Initiative parallel observational study, which was subject to
unobserved confounding from self-selection into elective treatment, and clinical trial, which
would otherwise be the gold-standard for causal inference. The observational study originally
suggested treatment might be beneficial for chronic disease prevention; while the clinical trial
had to be halted early due to the increased incidence of deaths. A common explanation was
that unobserved confounding led to overall healthier women being enrolled in the treatment
in the observational study.
Contributions and challenges. In a setting with unobserved confounders, the data
analyst only observes i.i.d draws from the joint distribution of covariate, treatment, and
outcome data, (X, T, Y ), for observed treatment tuples, but the underlying data, (U, X, T, Y ),
was generated with an additional unobserved confounder, U, which influences treatment
and outcome. If we had access to the full underlying data, it would be sufficient to learn
P(T | X, U), the true probability of treatment assignment (propensity score) in order to
statistically adjust for selection bias via the likelihood ratio, e.g. by importance sampling.
A data-driven approach recognizes that we may estimate P(T | X), which adjusts for
some but not all confounding. I develop a nonparametric minimax approach to learn a
machine learning policy with the best worst-case guarantee over an ambiguity set on the
inverse propensity weights. The ambiguity set U restricts deviations of the true underlying
inverse propensity weight W ∗ = 1/P(T |X,U) from what we can estimate from observed data,
W = 1/P(T |X). These restrictions are parametrized according to the marginal sensitivity
model from causal inference which translates to an interval uncertainty set of W ∗ with
respect to W . The overall “size” of the set is parameterized by a scalar parameter, Γ: an
analyst presumably specifies bounds on plausible ranges on the extent of residual unobserved
confounding.
A personalized decision policy π(X) maps covariates to the probability of an action.
For the purposes of interpretability and generalization, we optimize personalized decision
policies over a parametrized function class such as the family of decision rules based on a
linear index, π(X)= I[β>X> 0], or logistic assignment. When the treatment contrast
is coded T ∈ {−1, +1}, the standard policy learning problem solves maxπ E[Y TπW ], e.g.
by analogy to weighted classification. We solve a minimax problem, optimizing over the
robust worst-case regret relative to a baseline policy π0 over the ambiguity set, to find a
confounding-robust policy:
E[YT (π − π0)W ]
min max
πW ∈U E[W ]
The baseline policy is simple, such as all-treat or all-control, and improves our safety
guarantees. We leverage the special linear-fractional optimization structure to develop a
computationally efficient algorithmic framework for learning a robust policy. We take a robust
gradient descent approach: for parametrized policies, we solve the inner optimization to full
optimality via a ternary search procedure and evaluate gradients at the worst-case solution.
Our setting introduces technical challenges. In contrast to robust decision-making,
unobserved confounders result in ambiguity about both the underlying generating distribution
and the realizations that appear in our dataset. Obtaining statistical convergence results of
our robust machine-learning decision rule requires problem-specialized structural characterization
of the optimal subproblem solution or otherwise proving stability of the optimization problem.
Estimated nuisance parameters introduce perturbations to left-hand side constraints.
Our theoretical guarantees recover the same rate of convergence of the previous non-
robust approach with our novel safety/improvement guarantees: with high probability, the
sample-minimax optimal policy deviates from the population-minimax optimal policy by
a vanishing Op(n− 21
) term, where n is the number of confounded samples. Therefore, our
results show that researchers, analysts, and practitioners can develop confounding-robust
personalized decision policies at no great cost computationally nor statistically.
Follow-up work: infinite-horizon reinforcement learning. While minimax-policy
learning studies the single-stage decision case, in healthcare there is increasing interest in
managing chronic health conditions over time, such as insulin dosing in diabetic control,
which requires offline reinforcement learning and off-policy evaluation in the sequential
setting. However, data collected from electronic medical records in this setting is often
observational and hence subject to problems of unobserved confounders. We study robust
policy evaluation in [6] in the infinite horizon reinforcement learning setting. Instead of
a direct generalization of the robust estimator approach used in [2], which would become
“exponentially robust” in the time horizon, we build on a recently proposed estimating
equation for the stationary distribution density ratio. Assuming i.i.d unobserved confounders,
we optimize robust bounds on an estimating equation for the density ratio of state occupancy
measures.
Credible impact evaluation in algorithmic fairness ([7, 3, 4, 5]). In another line of
work, I develop robust, credible methodology for disparity assessment for the growing area
of algorithmic fairness and impact assessment in consequential settings more broadly. A
key question in fair data-driven decision-making is to trade-off and assess disparities in the
performance of decision rules, in particular along the dimensions of a “protected attribute”
upon which discrimination is prohibited, such as race or sex. I develop tools and crisp
managerial insights to align the performance benchmarks, by which we measure machine
learning progress and ultimately justify the deployment of algorithms, with the real-world
actual operating conditions and impacts of algorithms in consequential settings.
In “Assessing Disparities with Unobserved Protected Class” (accepted at Management
Science) [7], we develop methodology for assessing bounds on disparities in settings where the
protected attribute of interest is actually not recorded in the dataset of decision recommendations,
decisions, and covariates (Y, ˆ
Y,Z), but auxiliary data on the protected attribute and covariates
(A, Z) is available. This is a common setting in practice, such as in financial services,
where even regulators rely on proxy methods that estimate P(A | Z) but require untestable
assumptions. Thus, disparities crucially rely on the unobserved conditional joint distribution
although we only have access to the observed marginal distributions. We take a partial
identification approach and estimate the sets that correspond to all possible disparities that
are consistent with the observed data. Disparities are functionals that can be represented as
marginal averages of the unknown P(Y, Yˆ | A, Z) with respect to the estimable measure
P(A | Z). While P(Y, Yˆ | A, Z) is an unknown quantity, it satisfies properties of all
probability distributions, such as boundedness and the law of total probability, and its
marginalization over A, P(Y, Yˆ | Z), is estimable. We therefore optimize over the ambiguous
probability distribution P(Y, Yˆ | A, Z) subject to these constraints which arise from the
underlying probability distribution structure. A unifying optimization framework allows for
adding additional structural assumptions and developing a general-purpose algorithm for
recovering the set of disparities which are supported by the observed data in increasingly
complex settings. We use a support function representation of the partial identification set
in order to computationally represent the convex hull of the disparity set. Our work provides
tools to assess the basic limits of what can be concluded from the data; and can be used to
either verify that assessed disparities remain (regardless of the assumptions induced by any
further estimation procedure), or that investment in further refinement of auxiliary data is
necessary.
Other fairness work In [3], we show that in some consequential settings where fairness
in machine learning has been of concern – lending, criminal justice, and social services – all
machine learning methods are necessarily trained on data censored by previous decisions. A
key insight is that not only does the standard machine learning assumption of representative
data fail, its failure is an inevitable consequence of practical settings, and attempts to
mitigate disparities could inadvertently continue to disadvantage those who were harmed by
non-representation under the guise of fairness. Other work proposes and studies principled
evaluation metrics that better align with stated stakeholder desiderata as in the xAUC
metric [5] to deliver crisp insights to inform practitioners in these areas, or develops robust
evaluation for more challenging settings such as causal interventions [4]. These insights are
directly targeted towards researchers and practitioners; the xAUC metric was recently used
in a large-scale study of fairness in clinical risk scores [9].
I am particularly interested in studying practical opportunities to improve reliability
and outcomes from machine learning methods without relying on approaches that penalize
model performance. During my research internship with Miro Dudik and Jenn Wortman
Vaughan, motivated by a study of what business practitioners do in practice to improve model
disparities [1], I developed theory for and empirically studied when the common practical
intervention of collecting more data may improve fairness properties of regression models by
developing approximations to group-conditional finite-sample error which an analyst may
use to guide the collection of additional data to improve disparities most efficiently.
Ongoing and future research: In ongoing research, I am continuing work on off-policy
learning in the sequential setting. The key insight is that commonly in operations research, a
subset of the state variable undergoes known dynamics, conditional on a (possibly contextual
or covariate-conditional) effect. Therefore the global state variable (e.g. inventory or resource
consumption) introduces sequential dependence although the contextual personalized effect
is time-invariant. This setting bridges the single-stage and dynamic settings and highlights
the interface of statistical machine learning and operations research: the structure of OR
problems provides opportunities to leverage optimization structure in order to improve upon
off-policy learning algorithms and results that rely solely on statistical estimation.
In recently submitted work, I have developed a framework to study questions of algorithmic
fairness in relation to personalized pricing. The setting of personalized pricing can illustrate
the benefits of greater personalization for expanding access directing resources to those
who “value” them the most, but many real-world challenges, including the joint covariance
structure of groups and covariates or finite-sample uncertainty can lead to inequities in
resulting allocations. An important generalization for further work is studying the question of
fairness in decision utility of the contextual version of classical operations research stochastic
optimization problems which newly incorporate machine-learned predictors.
For future research, I plan on continuing my research on reliable, robust, and trustworthy
machine learning more broadly, with attention to the interface of optimization and estimation
for making better decisions from data. For example, jointly considering the robust bias
guarantees of the sensitivity parameter alongside the estimation variance can lead to endto-
end guarantees for accounting for both ambiguity and finite-sample uncertainty.
Other research directions continue to develop the interface between methodological theory
and practical settings and particularly benefit from the operations perspective. One research
direction recognizes that, as my work in fairness in pricing argues, in broader application
settings, short-term fairness or welfare considerations may be of interest because of longer-
term considerations such as consumer retention or future value. It would be important to
study this empirically as well as develop algorithmic personalization schemes that better
manage long-term system performance. Another research direction recognizes that while
my previous work considered the case of unobserved confounders, the opposite challenge
for algorithmic decision support is understanding how “humans-in-the-loop” should use
additional information available to decision-makers at the time of deployment. This is a
fundamental problem that limits the practical impact of prescriptions from algorithms in
practice, is not considered by conventional theory, and relates to statistical bias-variance
tradeoffs.
Over the long term, I plan on continuing to tailor statistical machine learning methodological
development to the challenges of emerging application areas.
References
[1] K. Holstein, J. Wortman Vaughan, H. Daum´e III, M. Dudik, and H. Wallach. Improving
fairness in machine learning systems: What do industry practitioners need? In
Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pages
1–16, 2019.
[2] N. Kallus and A. Zhou. Minimax-optimal policy learning under unobserved confounding.
Management Science (Accepted), supersedes Neurips 2018 version.
[3] N. Kallus and A. Zhou. Residual unfairness in fair machine learning from prejudiced
data. ICML, pages 2439–2448, 2018.
[4] N. Kallus and A. Zhou. Assessing disparate impact of personalized interventions:
Identifiability and bounds. Neurips, pages 3426–3437, 2019.
[5] N. Kallus and A. Zhou. The fairness of risk scores beyond classification: Bipartite ranking
and the xauc metric. Neurips, pages 3438–3448, 2019.
[6] N. Kallus and A. Zhou. Confounding-robust policy evaluation in infinite-horizon
reinforcement learning. Neurips, 2020.
[7] N. Kallus, X. Mao, and A. Zhou. Assessing algorithmic fairness with unobserved
protected class using data combination. Management Science (Accepted). A preliminary
version appeared at FaCCT 2020 as an extended abstract.
[8] N. Kallus, X. Mao, and A. Zhou. Interval estimation of individual-level causal effects
under unobserved confounding. AISTATS, pages 2281–2290, 2019.
[9] S. R. Pfohl, A. Foryciarz, and N. H. Shah. An empirical characterization of fair machine
learning for clinical risk prediction. arXiv preprint arXiv:2007.10306, 2020.