Figure 1: Overview of the setting of unsupervised domain adaptation and its difference with the standard setting of supervised learning. In domain adaptation the source (training) domain is related to but different from the target (testing) domain. During training, the algorithm can only have access to labeled samples from source domain and unlabeled samples from target domain. The goal is to generalize on the target domain.
One of the backbone assumptions underpinning the generalization theory of supervised learning algorithms is that the test distribution should be the same as the training distribution. However in many real-world applications it is usually time-consuming or even infeasible to collect labeled data from all the possible scenarios where our learning system is going to be deployed. For example, consider a typical application of vehicle counting, where we would like to count how many cars are there in a given image captured by the camera. There are over 200 cameras with different calibrations, perspectives, lighting conditions, etc. at different locations in Manhattan. In this case, it is very costly to collect labeled data of images from all the cameras. Ideally, we would collect labeled images for a subset of the cameras and still be able to train a counting system that would work well for all cameras.
Domain adaptation deals with the setting where we only have access to labeled data from the training distribution (a.k.a., source domain) and unlabeled data from the testing distribution (a.k.a., target domain). The setting is complicated by the fact that the source domain can be different from the target domain — just like the above example where different images taken from different cameras usually have different pixel distributions due to different perspectives, lighting, calibrations, etc. The goal of an adaptation algorithm is then to generalize to the target domain without seeing labeled samples from it.
In this blog post, we will first review a common technique to achieve this goal based on the idea of finding a domain-invariant representation. Then we will construct a simple example to show that this technique alone does not necessarily lead to good generalization on the target domain. To understand the failure mode, we give a generalization upper bound that decomposes into terms measuring the difference in input and label distributions between the source and target domains. Crucially, this bound allows us to provide a sufficient condition for good generalizations on the target domain. We also complement the generalization upper bound with an information-theoretic lower bound to characterize the trade-off in learning domain-invariant representations. Intuitively, this result says that when the marginal label distributions differ across domains, one cannot hope to simultaneously minimize both source and target errors by learning invariant representations; this provides a necessary condition for the success of methods based on learning invariant representations. All the material presented here is based on our recent work published at ICML 2019.
The central idea behind learning invariant representations is quite simple and intuitive: we want to find a representations that is insensitive to the domain shift while still capturing rich information for the target task. Such a representation would allow us to generalize to the target domain by only training with data from the source domain. The pipeline for learning domain invariant representations is illustrated in Figure 3.
Note that in the framework above we can use different transformation functions \(g_S/g_T\) on the source/target domain to align the distributions. This powerful framework is also very flexible: by using different measures to align the feature distributions, we recover several of the existing approaches, e.g., DANN (Ganin et al.’ 15), DAN (Long et al.’ 15) and WDGRL (Shen et al.’ 18).
A theoretical justification for the above framework is the following generalization bound by Ben-David et al.’ 10: Let \(\mathcal{H}\) be a hypothesis class and \(\mathcal{D}_S/\mathcal{D}_T\) be the marginal data distributions of source/target domains, respectively. For any \(h\in\mathcal{H}\), the following generalization bound holds: $$\varepsilon_T(h) \leq \varepsilon_S(h) + d(\mathcal{D}_S, \mathcal{D}_T) + \lambda^*,$$ where \(\lambda^* = \inf_{h\in\mathcal{H}} \varepsilon_S(h) + \varepsilon_T(h)\) is the optimal joint error achievable on both domains. At a colloquial level, the above generalization bound shows that the target risk could essentially be bounded by three terms:
The interpretation of the bound is as follows. If there exists a hypothesis that works well on both domains, then in order to minimize the target risk, one should choose a hypothesis that minimizes the source risk while at the same time aligning the source and target data distributions.
The above framework for domain adaptation has generated a surge of interest in recent years and we have seen many interesting variants and applications based on the general idea of learning domain-invariant representations. Yet it is not clear whether such methods are guaranteed to succeed when the following conditions are met:
Since we can only train with labeled data from the source domain, ideally we would hope that when the above two conditions are met, the composition function \(h\circ g\) also achieves a small risk on the target domain because these two domains are close to each other in the feature space. Perhaps somewhat surprisingly, this is not the case as we demonstrate from the following simple example illustrated in Figure 4.
Consider an adaptation problem where we have input space and feature space \(\mathcal{X} = \mathcal{Z} = \mathbb{R}\) with source domain \(\mathcal{D}_S = U(-1,0)\) and target domain \(\mathcal{D}_T = U(1,2)\), respectively, where we use \(U(a,b)\) to mean a uniform distribution in the interval \((a, b)\). In this example, the two domains are so far away from each other that their supports are disjoint! Now, let’s try to align them so that they are closer to each other. We can do this by shifting the source domain to the right by one unit and then shifting the target domain to the left by one unit.
As shown in Figure 4, after adaptation both domains have distribution \(U(0, 1)\), i.e., they are perfectly aligned by our simple translation transformation. However, due to our construction, now the labels are flipped between the two domains: for every \(x\in (0, 1)\), exactly one of the domains has label 1 and the other has label 0. This implies that if a hypothesis achieves perfect classification on the source domain, it will also incur the maximum risk of 1 on the target domain! In fact, in this case we have \(\varepsilon_S(h) + \varepsilon_T(h) = 1\) after adaptation for any classifier \(h\). As a comparison, before adaptation, a simple interval hypothesis \(h^*(x) = 1\) iff \(x\in (-1/2, 3/2)\) attains perfect classification on both domains.
So what insights can we gain from the previous counter-example? Why do we incur a large target error despite perfectly aligning the marginal distributions of the two domains and minimizing the source error? Does this contradict Ben-David et al.’s generalization bound?
The caveat here is that while the distance between the two domains becomes 0 after the adaptation, the optimal joint error on both domains becomes large. In the counter-example above, this means that after adaptation \(\lambda^{*} = 1\), which further implies \(\varepsilon_T(h) = 1\) if \(\varepsilon_S(h) = 0\). Intuitively, from Figure 4 we can see that the labeling functions of the two domains are “maximally different” from each other after adaptation, but during adaptation we are only aligning the marginal distributions in the feature space. Since the optimal joint error \(\lambda^*\) is often unknown and intractable to compute, could we construct a generalization upper bound that is free of the constant \(\lambda^*\) and takes into account the conditional shift?
Here is an informal description of what we show in our paper: Let \(f_S\) and \(f_T\) be the labeling functions of the source and target domains. Then for any hypothesis class \(\mathcal{H}\) and any \(h\in\mathcal{H}\), the following inequality holds: $$\varepsilon_T(h) \leq \varepsilon_S(h) + d(\mathcal{D}_S, \mathcal{D}_T) + \min\{\mathbb{E}_S[|f_S – f_T|], \mathbb{E}_T[|f_S – f_T|]\}.$$
Roughly speaking, the above bound gives a decomposition of the difference of errors between source and target domains. Again, the second term on the RHS measures the difference between the marginal data distributions. But, in place of the optimal joint error term, the third term now measures the discrepancy between the labeling functions of these two domains. Hence, this bound says that just aligning the marginal data distributions is not sufficient for adaptation, we also need to ensure that the label functions (conditional distributions) are close to each other after adaptation.
In the counter-example above, we demonstrated that aligning the marginal distributions and achieving a small source error is not sufficient to guarantee a small target error. But in this example, it is actually possible to find another feature transformation that jointly aligns both the marginal data distributions and the labeling functions. Specifically, let the feature transformation \(g(x) = \mathbb{I}_{x\leq 0}(x)(x+1) + \mathbb{I}_{x > 0}(x)(2-x)\). Then, it is straightforward to verify that the source and target domains perfectly align with each other after adaptation. Furthermore, we also have \(\varepsilon_T(h) = 0\) if \(\varepsilon_S(h) = 0\).
Consequently, it is natural to wonder whether it is always possible to find a feature transformation and a hypothesis to align the marginal data distributions and minimize the source error so that the composite function of these two also achieves a small target error? Quite surprisingly, we show that this is not always possible. In fact, finding a feature transformation to align the marginal distributions can provably increase the joint error on both domains. With this transformation, minimizing the source error will only lead to increasing the target error!
More formally, let \(\mathcal{D}_S^Y/\mathcal{D}_T^Y\) be the marginal label distribution of the source/target domain. For any feature transformation \(g: X\to Z\), let \(\mathcal{D}_S^Z/\mathcal{D}_T^Z\) be the resulting feature distribution by applying \(g(\cdot)\) to \(\mathcal{D}_S/\mathcal{D}_T\) respectively. Furthermore, define \(d_{\text{JS}}(\cdot, \cdot)\) to be the Jensen-Shannon distance between a pair of distributions. Then, for any hypothesis \(h: Z\to\{0, 1\}\), if \(d_{\text{JS}}(\mathcal{D}_S^Y, \mathcal{D}_T^Y) \geq d_{\text{JS}}(\mathcal{D}_S^Z, \mathcal{D}_T^Z)\), the following inequality holds: $$\varepsilon_S(h\circ g) + \varepsilon_T(h\circ g)\geq \frac{1}{2}\left(d_{\text{JS}}(\mathcal{D}_S^Y, \mathcal{D}_T^Y) – d_{\text{JS}}(\mathcal{D}_S^Z, \mathcal{D}_T^Z)\right)^2.$$
Let’s parse the above lower bound step by step. The LHS corresponds to the joint error achievable by the composite function \(h\circ g\) on both the source and the target domains. The RHS contains the distance between the marginal label distributions and the distance between the feature distributions. Hence, when the marginal label distributions \(\mathcal{D}_S^Y/\mathcal{D}_T^Y\) differ between two domains, i.e., \(d_{\text{JS}}(\mathcal{D}_S^Y, \mathcal{D}_T^Y) > 0\), aligning the marginal data distributions by learning \(g(\cdot)\) will only increase the lower bound. In particular, for domain-invariant representations where \(d_{\text{JS}}(\mathcal{D}_S^Z, \mathcal{D}_T^Z) = 0\), the lower bound attains its maximum value of \(\frac{1}{2}d^2_{\text{JS}}(\mathcal{D}_S^Y, \mathcal{D}_T^Y)\). Since in domain adaptation we only have access to labeled data from the source domain, minimizing the source error will only lead to an increase of the target error. In a nutshell, this lower bound can be understood as an uncertainty principle: when the marginal label distributions differ across domains, one has to incur large error in either the source domain or the target domain when using domain-invariant representations.
One implication made by our lower bound is that when two domains have different marginal label distributions, minimizing the source error while aligning the two domains can lead to increased target error. To verify this, let us consider the task of digit classification on the MNIST, SVHN and USPS datasets. The label distributions of these three datasets are shown in Figure 5.
From Figure 5, it is clear to see that these three datasets have quite different label distributions. Now let’s use DANN (Ganin et al., 2015) to classify on the target domain by learning a domain invariant representation while training to minimize error on the source domain.
We plot four adaptation trajectories for DANN in Figure 6. Across the four adaptation tasks, we can observe the following pattern: the test domain accuracy rapidly grows within the first 10 iterations before gradually decreasing from its peak, despite consistently increasing source training accuracy. These phase transitions can be verified from the negative slopes of the least squares fit of the adaptation curves (dashed lines in Figure 6). The above experimental results are consistent with our theoretical findings: over-training on the source task can indeed hurt generalization to the target domain when the label distributions differ.
Note that the failure mode in the above counter-example is due to the increase of the distance between the labeling functions during adaptation. One interesting direction for future work is then to characterize what properties the feature transformation function should have in order to decrease the shift between labeling functions. Of course domain adaptation would not be possible without proper assumptions on the underlying source/target domains. It would be nice to establish some realistic assumptions under which we can develop effective adaptation algorithms that align both the marginal distributions and the labeling functions. Feel free to get in touch if you’d like to talk more!
DISCLAIMER: All opinions expressed in this post are those of the author and do not represent the views of CMU.
]]>We are proud to present the following papers at the 33rd Conference on Neural Information Processing Systems (NeurIPS) in Vancouver, Canada. Check back for an update with poster numbers and links once the camera-ready papers become available.
If you are attending NeurIPS 2019, please stop by to say hello and hear more about what we are doing!
Joint-task Self-supervised Learning for Temporal Correspondence
Xueting Li (uc merced) · Sifei Liu (NVIDIA) · Shalini De Mello (NVIDIA) · Xiaolong Wang (CMU) · Jan Kautz (NVIDIA) · Ming-Hsuan Yang (UC Merced / Google)
Deep Equilibrium Models
Shaojie Bai (Carnegie Mellon University) · J. Zico Kolter (Carnegie Mellon University / Bosch Center for AI) · Vladlen Koltun (Intel Labs)
Volumetric Correspondence Networks for Optical Flow
Gengshan Yang (Carnegie Mellon University) · Deva Ramanan (Carnegie Mellon University)
Efficient Symmetric Norm Regression via Linear Sketching
Zhao Song (University of Washington) · Ruosong Wang (Carnegie Mellon University) · Lin Yang (Johns Hopkins University) · Hongyang Zhang (Carnegie Mellon University) · Peilin Zhong (Columbia University)
Envy-Free Classification
Maria-Florina Balcan (Carnegie Mellon University) · Travis Dick (Carnegie Mellon University) · Ritesh Noothigattu (Carnegie Mellon University) · Ariel D Procaccia (Carnegie Mellon University)
Twin Auxilary Classifiers GAN
Mingming Gong (University of Melbourne) · Yanwu Xu (University of Pittsburgh) · Chunyuan Li (Microsoft Research) · Kun Zhang (CMU) · Kayhan Batmanghelich (University of Pittsburgh)
Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift
Stephan Rabanser (Amazon) · Stephan Günnemann (Technical University of Munich) · Zachary Lipton (Carnegie Mellon University)
Backprop with Approximate Activations for Memory-efficient Network Training
Ayan Chakrabarti (Washington University in St. Louis) · Benjamin Moseley (Carnegie Mellon University)
Total Least Squares Regression in Input Sparsity Time
Huaian Diao (Northeast Normal University) · Zhao Song (Harvard University & University of Washington) · David Woodruff (Carnegie Mellon University) · Xin Yang (University of Washington)
Conformal Prediction Under Covariate Shift
Rina Foygel Barber (University of Chicago) · Emmanuel Candes (Stanford University) · Aaditya Ramdas (CMU) · Ryan Tibshirani (Carnegie Mellon University)
Optimal Analysis of Subset-Selection Based L_p Low-Rank Approximation
Chen Dan (Carnegie Mellon University) · Hong Wang (Massachusetts Institute of Technology) · Hongyang Zhang (Carnegie Mellon University) · Yuchen Zhou (University of Wisconsin, Madison) · Pradeep Ravikumar (Carnegie Mellon University)
Third-Person Visual Imitation Learning via Decoupled Hierarchical Control
Pratyusha Sharma (Carnegie Mellon University) · Deepak Pathak (UC Berkeley) · Abhinav Gupta (Facebook AI Research/CMU)
Visual Sequence Learning in Hierarchical Prediction Networks and Primate Visual Cortex
JIELIN QIU (Shanghai Jiao Tong University) · Ge Huang (Carnegie Mellon University) · Tai Sing Lee (Carnegie Mellon University)
Optimal Decision Tree with Noisy Outcomes
Su Jia (CMU) · viswanath nagarajan (Univ Michigan, Ann Arbor) · Fatemeh Navidi (University of Michigan) · R Ravi (CMU)
Learning Sample-Specific Models with Low-Rank Personalized Regression
Ben Lengerich (Carnegie Mellon University) · Bryon Aragam (University of Chicago) · Eric Xing (Petuum Inc. / Carnegie Mellon University)
A Normative Theory for Causal Inference and Bayes Factor Computation in Neural Circuits
Wenhao Zhang (Carnegie Mellon & U. of Pittsburgh) · Si Wu (Peking University) · Brent Doiron (University of Pittsburgh) · Tai Sing Lee (Carnegie Mellon University)
Regularized Weighted Low Rank Approximation
Frank Ban (UC Berkeley) · David Woodruff (Carnegie Mellon University) · Richard Zhang (UC Berkeley)
Partially Encrypted Deep Learning using Functional Encryption
Theo Ryffel (École Normale Supérieure) · David Pointcheval (École Normale Supérieure) · Francis Bach (INRIA – Ecole Normale Superieure) · Edouard Dufour-Sans (Carnegie Mellon University) · Romain Gay (UC Berkeley)
Learning low-dimensional state embeddings and metastable clusters from time series data
Yifan Sun (Carnegie Mellon University) · Yaqi Duan (Princeton University) · Hao Gong (Princeton University) · Mengdi Wang (Princeton University)
Offline Contextual Bayesian Optimization
Ian Char (Carnegie Mellon University) · Youngseog Chung (Carnegie Mellon University) · Willie Neiswanger (Carnegie Mellon University) · Kirthevasan Kandasamy (Carnegie Mellon University) · Oak Nelson (Princeton Plasma Physics Lab) · Mark Boyer (Princeton Plasma Physics Lab) · Egemen Kolemen (Princeton Plasma Physics Lab) · Jeff Schneider (Carnegie Mellon University)
Game Design for Eliciting Distinguishable Behavior
Fan Yang (Carnegie Mellon University) · Liu Leqi (Carnegie Mellon University) · Yifan Wu (Carnegie Mellon University) · Zachary Lipton (Carnegie Mellon University) · Pradeep Ravikumar (Carnegie Mellon University) · Tom M Mitchell (Carnegie Mellon University) · William Cohen (Google AI)
Optimal Sketching for Kronecker Product Regression and Low Rank Approximation
Huaian Diao (Northeast Normal University) · Rajesh Jayaram (Carnegie Mellon University) · Zhao Song (UT-Austin) · Wen Sun (Microsoft Research) · David Woodruff (Carnegie Mellon University)
Online Learning for Auxiliary Task Weighting for Reinforcement Learning
Xingyu Lin (Carnegie Mellon University) · Harjatin Baweja (CMU) · George Kantor (CMU) · David Held (CMU)
Cost Effective Active Search
Shali Jiang (Washington University in St. Louis) · Roman Garnett (Washington University in St. Louis) · Benjamin Moseley (Carnegie Mellon University)
Mutually Regressive Point Processes
Ifigeneia Apostolopoulou (Carnegie Mellon University) · Scott Linderman (Stanford University) · Kyle Miller (Carnegie Mellon University) · Artur Dubrawski (Carnegie Mellon University)
Efficient Regret Minimization Algorithm for Extensive-Form Correlated Equilibrium
Gabriele Farina (Carnegie Mellon University) · Chun Kai Ling (Carnegie Mellon University) · Fei Fang (Carnegie Mellon University) · Tuomas Sandholm (Carnegie Mellon University)
Optimistic Regret Minimization for Extensive-Form Games via Dilated Distance-Generating Functions
Gabriele Farina (Carnegie Mellon University) · Christian Kroer (Columbia University) · Tuomas Sandholm (Carnegie Mellon University)
Face Reconstruction from Voice using Generative Adversarial Networks
Yandong Wen (Carnegie Mellon University) · Bhiksha Raj (Carnegie Mellon University) · Rita Singh (Carnegie Mellon University)
On Testing for Biases in Peer Review
Ivan Stelmakh (Carnegie Mellon University) · Nihar Shah (CMU) · Aarti Singh (CMU)
Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels
Simon Du (Carnegie Mellon University) · Kangcheng Hou (Zhejiang University) · Ruslan Salakhutdinov (Carnegie Mellon University) · Barnabas Poczos (Carnegie Mellon University) · Ruosong Wang (Carnegie Mellon University) · Keyulu Xu (MIT)
Acceleration via Symplectic Discretization of High-Resolution Differential Equations
Bin Shi (UC Berkeley) · Simon Du (Carnegie Mellon University) · Weijie Su (University of Pennsylvania) · Michael Jordan (UC Berkeley)
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Zhilin Yang (Tsinghua University) · Zihang Dai (Carnegie Mellon University) · Yiming Yang (CMU) · Jaime Carbonell (CMU) · Ruslan Salakhutdinov (Carnegie Mellon University) · Quoc V Le (Google)
Mixtape: Breaking the Softmax Bottleneck Efficiently
Zhilin Yang (Tsinghua University) · Thang Luong (Google) · Ruslan Salakhutdinov (Carnegie Mellon University) · Quoc V Le (Google)
MaCow: Masked Convolutional Generative Flow
Xuezhe Ma (Carnegie Mellon University) · Xiang Kong (Carnegie Mellon University) · Shanghang Zhang (Carnegie Mellon University) · Eduard Hovy (Carnegie Mellon University)
Adaptive Gradient-Based Meta-Learning Methods
Mikhail Khodak (CMU) · Maria-Florina Balcan (Carnegie Mellon University) · Ameet Talwalkar (CMU)
Towards a Zero-One Law for Column Subset Selection
Zhao Song (University of Washington) · David Woodruff (Carnegie Mellon University) · Peilin Zhong (Columbia University)
Dual Adversarial Semantics-Consistent Network for Generalized Zero-Shot Learning
Jian Ni (University of Science and Technology of China) · Shanghang Zhang (Carnegie Mellon University) · Haiyong Xie (University of Science and Technology of China)
Likelihood-Free Overcomplete ICA and ApplicationsIn Causal Discovery
Chenwei DING (The University of Sydney) · Mingming Gong (University of Melbourne) · Kun Zhang (CMU) · Dacheng Tao (University of Sydney)
The bias of the sample mean in multi-armed bandits can be positive or negative
Jaehyeok Shin (Carnegie Mellon University) · Aaditya Ramdas (Carnegie Mellon University) · Alessandro Rinaldo (CMU)
Efficient and Thrifty Voting by Any Means Necessary
Debmalya Mandal (Columbia University) · Ariel D Procaccia (Carnegie Mellon University) · Nisarg Shah (University of Toronto) · David Woodruff (Carnegie Mellon University)
Re-examination of the Role of Latent Variables in Sequence Modeling
Guokun Lai (Carnegie Mellon University) · Zihang Dai (Carnegie Mellon University)
Towards Understanding the Importance of Shortcut Connections in Residual Networks
Tianyi Liu (Georgia Institute of Technolodgy) · Minshuo Chen (Georgia Tech) · Mo Zhou (Duke University) · Simon Du (Carnegie Mellon University) · Enlu Zhou (Georgia Institute of Technology) · Tuo Zhao (Gatech)
Learning Local Search Heuristics for Boolean Satisfiability
Emre Yolcu (Carnegie Mellon University) · Barnabas Poczos (Carnegie Mellon University)
Difference Maximization Q-learning: Provably Efficient Q-learning with Function Approximation
Simon Du (Carnegie Mellon University) · Yuping Luo (Princeton University) · Ruosong Wang (Carnegie Mellon University) · Hanrui Zhang (Duke University)
On Exact Computation with an Infinitely Wide Neural Net
Sanjeev Arora (Princeton University) · Simon Du (Carnegie Mellon University) · Wei Hu (Princeton University) · zhiyuan li (Princeton University) · Ruslan Salakhutdinov (Carnegie Mellon University) · Ruosong Wang (Carnegie Mellon University)
Paradoxes in Fair Machine Learning
Paul Goelz (Carnegie Mellon University) · Anson Kahng (Carnegie Mellon University) · Ariel D Procaccia (Carnegie Mellon University)
Graph Agreement Models for Semi-Supervised Learning
Otilia Stretcu (Carnegie Mellon University) · Krishnamurthy Viswanathan (Google Research) · Dana Movshovitz-Attias (Google) · Emmanouil Platanios (Carnegie Mellon University) · Sujith Ravi (Google Research) · Andrew Tomkins (Google)
Nonparametric Density Estimation & Convergence Rates for GANs under Besov IPM Losses
Ananya Uppal (Carnegie Mellon University) · Shashank Singh (Carnegie Mellon University) · Barnabas Poczos (Carnegie Mellon University)
Correlation in Extensive-Form Games: Saddle-Point Formulation and Benchmarks
Gabriele Farina (Carnegie Mellon University) · Chun Kai Ling (Carnegie Mellon University) · Fei Fang (Carnegie Mellon University) · Tuomas Sandholm (Carnegie Mellon University)
ADDIS: an adaptive discarding algorithm for online FDR control with conservative nulls
Jinjin Tian (Carnegie Mellon University) · Aaditya Ramdas (Carnegie Mellon University)
Tight Dimensionality Reduction for Sketching Low Degree Polynomial Kernels
Michela Meister (Google) · Tamas Sarlos (Google Research) · David Woodruff (Carnegie Mellon University)
Differentiable Convex Optimization Layers
Akshay Agrawal (Stanford University) · Brandon Amos (Facebook) · Shane Barratt (Stanford University) · Stephen Boyd (Stanford University) · Steven Diamond (Stanford University) · J. Zico Kolter (Carnegie Mellon University / Bosch Center for AI)
Average Case Column Subset Selection for Entrywise $\ell_1$-Norm Loss
Zhao Song (University of Washington) · David Woodruff (Carnegie Mellon University) · Peilin Zhong (Columbia University)
Efficient Forward Architecture Search
Hanzhang Hu (Carnegie Mellon University) · John Langford (Microsoft Research New York) · Rich Caruana (Microsoft) · Saurajit Mukherjee (microsoft) · Eric J Horvitz (Microsoft Research) · Debadeepta Dey (Microsoft Research AI)
Efficient Near-Optimal Testing of Community Changes in Balanced Stochastic Block Models
Aditya Gangrade (Boston University) · Praveen Venkatesh (Carnegie Mellon University) · Bobak Nazer (Boston University) · Venkatesh Saligrama (Boston University)
Learning Robust Global Representations by Penalizing Local Predictive Power
Haohan Wang (Carnegie Mellon University) · Songwei Ge (Carnegie Mellon University) · Zachary Lipton (Carnegie Mellon University) · Eric Xing (Petuum Inc. / Carnegie Mellon University)
Unsupervised Curricula for Visual Meta-Reinforcement Learning
Allan Jabri (UC Berkeley) · Kyle Hsu (University of Toronto) · Ben Eysenbach (Carnegie Mellon University) · Abhishek Gupta (University of California, Berkeley) · Alexei Efros (UC Berkeley) · Sergey Levine (UC Berkeley) · Chelsea Finn (Stanford University)
Deep Gamblers: Learning to Abstain with Portfolio Theory
Ziyin Liu (University of Tokyo) · Zhikang Wang (University of Tokyo) · Paul Pu Liang (Carnegie Mellon University) · Ruslan Salakhutdinov (Carnegie Mellon University) · Louis-Philippe Morency (Carnegie Mellon University) · Masahito Ueda (University of Tokyo)
Statistical Analysis of Nearest Neighbor Methods for Anomaly Detection
Xiaoyi Gu (Carnegie Mellon University) · Leman Akoglu (CMU) · Alessandro Rinaldo (CMU)
On the (in)fidelity and sensitivity of explanations
Chih-Kuan Yeh (Carnegie Mellon University) · Cheng-Yu Hsieh (National Taiwan University) · Arun Suggala (Carnegie Mellon University) · David Inouye (Carnegie Mellon University) · Pradeep Ravikumar (Carnegie Mellon University)
Learning Stable Deep Dynamics Models
J. Zico Kolter (Carnegie Mellon University / Bosch Center for AI) · Gaurav Manek (Carnegie Mellon University)
Learning Neural Networks with Adaptive Regularization
Han Zhao (Carnegie Mellon University) · Yao-Hung Tsai (Carnegie Mellon University) · Ruslan Salakhutdinov (Carnegie Mellon University) · Geoffrey Gordon (MSR Montréal & CMU)
Uniform convergence may be unable to explain generalization in deep learning
Vaishnavh Nagarajan (Carnegie Mellon University) · J. Zico Kolter (Carnegie Mellon University / Bosch Center for AI)
Adversarial Music: Real world Audio Adversary against Wake-word Detection System
Juncheng Li (Carnegie Mellon University) · Shuhui Qu (Stanford University) · Xinjian Li (Carnegie Mellon University) · Joseph Szurley (Bosch Center for Artificial Intelligence) · J. Zico Kolter (Carnegie Mellon University / Bosch Center for AI) · Florian Metze (Carnegie Mellon University)
Neuropathic Pain Diagnosis Simulator for Causal Discovery Algorithm Evaluation
Ruibo Tu (KTH Royal Institute of Technology) · Kun Zhang (CMU) · Bo Bertilson (KI Karolinska Institutet) · Hedvig Kjellstrom (KTH Royal Institute of Technology) · Cheng Zhang (Microsoft)
Triad Constraints for Learning Causal Structure of Latent Variables
Ruichu Cai (Guangdong University of Technology) · Feng Xie (Guangdong University of Technology) · Clark Glymour (Carnegie Mellon University) · Zhifeng Hao (Guangdong University of Technology) · Kun Zhang (CMU)
Kalman Filter, Sensor Fusion, and Constrained Regression: Equivalences and Insights
David Farrow (Carnegie Mellon University) · Maria Jahja (Carnegie Mellon University) · Roni Rosenfeld (Carnegie Mellon University) · Ryan Tibshirani (Carnegie Mellon University)
Specific and Shared Causal Relation Modeling and Mechanism-based Clustering
Biwei Huang (Carnegie Mellon University) · Kun Zhang (CMU) · Pengtao Xie (Petuum / CMU) · Mingming Gong (University of Melbourne) · Eric Xing (Petuum Inc.) · Clark Glymour (Carnegie Mellon University)
Towards modular and programmable architecture search
Renato Negrinho (Carnegie Mellon University) · Matthew Gormley (Carnegie Mellon University) · Geoffrey Gordon (MSR Montréal & CMU) · Darshan Patil (Carnegie Mellon University) · Nghia Le (Carnegie Mellon University) · Daniel Ferreira (TU Wien)
Are Sixteen Heads Really Better than One?
Paul Michel (Carnegie Mellon University, Language Technologies Institute) · Omer Levy (Facebook) · Graham Neubig (Carnegie Mellon University)
Inducing brain-relevant bias in natural language processing models
Dan Schwartz (Carnegie Mellon University) · Mariya Toneva (Carnegie Mellon University) · Leila Wehbe (Carnegie Mellon University)
Differentially Private Covariance Estimation
Kareem Amin (Google Research) · Travis Dick (Carnegie Mellon University) · Alex Kulesza (Google) · Andres Munoz (Google) · Sergei Vassilvitskii (Google)
Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption
Wei Ma (Carnegie Mellon University) · George Chen (Carnegie Mellon University)
Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain)
Mariya Toneva (Carnegie Mellon University) · Leila Wehbe (Carnegie Mellon University)
On Human-Aligned Risk Minimization
Liu Leqi (Carnegie Mellon University) · Adarsh Prasad (Carnegie Mellon University) · Pradeep Ravikumar (Carnegie Mellon University)
Search on the Replay Buffer: Bridging Planning and Reinforcement Learning
Ben Eysenbach (Carnegie Mellon University) · Ruslan Salakhutdinov (Carnegie Mellon University) · Sergey Levine (UC Berkeley)
Multiple Futures Prediction
Charlie Tang (Apple Inc.) · Ruslan Salakhutdinov (Carnegie Mellon University)
Neural Taskonomy: Inferring the Similarity of Task-Derived Representations from Brain Activity
Aria Y Wang (Carnegie Mellon University) · Leila Wehbe (Carnegie Mellon University) · Michael J Tarr (Carnegie Mellon University)
Inherent Tradeoffs in Learning Fair Representation
Han Zhao (Carnegie Mellon University) · Geoff Gordon (Microsoft)
Learning Data Manipulation for Augmentation and Weighting
Zhiting Hu (Carnegie Mellon University) · Bowen Tan (CMU) · Ruslan Salakhutdinov (Carnegie Mellon University) · Tom Mitchell (Carnegie Mellon University) · Eric Xing (Petuum Inc. / Carnegie Mellon University)
Figure 1: Comparison of existing algorithms without policy certificates (top) and with our proposed policy certificates (bottom). While in existing reinforcement learning the user has no information about how well the algorithm will perform in the next episode, we propose that algorithms output policy certificates before playing an episode to allow users to intervene if necessary.
Designing reinforcement learning methods which find a good policy with as few samples as possible is a key goal of both empirical and theoretical research. On the theoretical side there are two main ways, regret- or PAC (probably approximately correct) bounds, to measure and guarantee sample-efficiency of a method. Ideally, we would like to have algorithms that have good performance according to both criteria, as they measure different aspects of sample efficiency and we have shown previously that one cannot simply go from one to the other. In a specific setting called tabular episodic MDPs, a recent algorithm achieved close to optimal regret bounds but there was no methods known to be close to optimal according to the PAC criterion despite a long line of research. In our work presented at ICML 2019, we close this gap with a new method that achieves minimax-optimal PAC (and regret) bounds which match the statistical worst-case lower bounds in the dominating terms.
Interestingly, we achieve this by addressing a general issue of PAC and regret bounds which is that they do not reveal when an algorithm will potentially take bad actions (only e.g. how often). This issue leads to a lack of accountability that could be particularly problematic in high-stakes applications (see a motivational scenario in Figure 2).
Besides being sample-efficient, our algorithm also does not suffer from this lack of accountability because it outputs what we call policy certificates. Policy certificates are confidence intervals around the current expected return of the algorithm and optimal return given to us by the algorithm before each episode (see Figure 1). This information allows users of our algorithms to intervene if the certified performance is not deemed adequate. We accompany this algorithm with a new type of learning guarantee called IPOC that is stronger than PAC, regret and the recent Uniform-PAC as it ensures not only sample-efficiency but also the tightness of policy certificates. We primarily consider the simple tabular episodic setting where there is only a small number of possible states and actions. While this is often not the case in practical applications, we believe that the insights developed in this work can potentially be used to design more sample-efficient and accountable reinforcement learning methods for challenging real-world problems with rich observations like images or text.
We propose to make methods for episodic reinforcement learning more accountable by having them output a policy certificate before each episode. A policy certificate is a confidence interval \([l_k, u_k]\) where \(k\) is the episode index. This interval contains both the expected sum of rewards of the algorithm’s policy in the next episode and the optimal expected sum of rewards in the next episode (see Figure 1 for an illustration). As such, a policy certificate helps answer two questions which are of interest in many applications:
Policy certificates are only useful if these confidence intervals are not too loose. To ensure this, we introduce a type of guarantee for algorithms with policy certificates IPOC (Individual POlicy Certificates) bounds. These bounds guarantee that all certificates are valid confidence intervals and bound the number of times their length can exceed any given threshold. IPOC bounds guarantee both the sample-efficiency of policy learning and the accuracy of policy certificates. That means the algorithm has to play better and better policies but also needs to tell us more accurately how good these policies are. IPOC bounds are stronger than existing learning bounds such as PAC or regret (see Figure 3) and imply that the algorithm is anytime interruptible (see paper for details).
Policy certificates are not limited to specific types of algorithms but optimistic algorithms are particularly natural to extend to output policy certificates. These methods give us the upper end of certificates “for free” as they maintain an upper confidence bound \(\tilde Q(s,a)\) on the optimal value function Q*(s,a) and follow the greedy policy π with respect to this upper confidence bound. In similar fashion, we can compute a lower confidence bound \(\underset{\sim}{Q}(s,a)\) of the Q-function \(Q^\pi (s,a)\) of this greedy policy. The certificate for this policy is then just these confidence bounds evaluated at the initial state \(s_1\) of the episode \([l_k, u_k] = \left[ \underset{\sim}{Q}(s_1, \pi(s_1)), \tilde Q(s_1, \pi(s_1)\right]\)
We demonstrate this principle with a new algorithm called ORLC (Optimistic RL with Certificates) for tabular MDPs. Similar to existing optimistic algorithms like UCBVI and UBEV, it computes the confidence bounds \(\tilde Q\) by optimistic value iteration on an estimated model but also computes lower confidence bounds \(\underset{\sim}{Q}\) with a pessimistic version of value iteration. These procedures are similar to vanilla value iteration but add optimism bonuses or subtract pessimism bonuses in each time step respectively to ensure high confidence bounds.
Interestingly, we found that computing lower confidence bounds for policy certificates can also improve sample-efficiency of policy learning. More concretely, we could tighten the optimism bonuses in our tabular method ORLC using the lower bounds \(\underset{\sim}{Q}\). This makes the algorithm less conservative and able to adjust more quickly to observed data. As a result, we were able to prove the first PAC bounds for tabular MDPs that are minimax-optimal in the dominating term:
Theorem: Minimax IPOC Mistake, PAC and regret bound of ORLC
In any episodic MDP with S states, A actions and an episode length H, the algorithm ORLC satisfies the IPOC Mistake bound below. That is, with probability at least \(1-\delta\), all certificates are valid confidence intervals and for all \(\epsilon > 0\) ORLC outputs certificates larger than \(\epsilon\) in at most
$$\tilde O\left( \frac{S A H^2}{\epsilon^2}\ln \frac 1 \delta + \frac{S^2 A H^3}{\epsilon}\ln \frac 1 \delta \right)$$
episodes. This immediately implies that the bound above is a (Uniform-)PAC bound and that ORLC satisfies a high-probability regret bound for all number of episodes \(T\) of
$$\tilde O\left( \sqrt{SAH^2 T} \ln 1/\delta + S^2 A H^3 \ln(T / \delta) \right)$$.
Comparing the order of our PAC bounds against the statistical lower bounds and prior state of the art PAC and regret bounds in the table below, this is the first time the optimal polynomial dependency of \(SAH^2\) has been achieved in the dominating \(\epsilon^{-2}\) term. Our bounds also improve the prior regret bounds of UCBVI-BF by avoiding their \(\sqrt{H^3T}\) terms, making our bounds minimax-optimal even when the episode length \(H\) is large.
Algorithm | (mistake) PAC bound | Regret bound | IPOC Mistake bound |
Lower bounds | \( \frac{SAH^2}{\epsilon^2} \) | \( \sqrt{H^2 S A T}\) | \( \frac{SAH^2}{\epsilon^2} \) |
ORLC (our) | \( \frac{SAH^2}{\epsilon^2} + \frac{S^2 A H^3}{\epsilon} \) | \( \sqrt{H^2 S A T} + S^2 AH^3 \) | \( \frac{SAH^2}{\epsilon^2} + \frac{S^2 A H^3}{\epsilon} \) |
UCBVI | – | \( \sqrt{H^2 S A T} + \sqrt{H^3 T} + S^2 AH^2 \) | – |
UBEV | \( \frac{SAH^3}{\epsilon^2} + \frac{S^2 A H^3}{\epsilon} \) | \( \sqrt{H^3 S A T} + S^2 AH^3 \) | – |
UCFH | \( \frac{S^2AH^2}{\epsilon^2} \) | – | – |
As mentioned above, our algorithm achieves this new IPOC guarantee and improved PAC bounds by maintaining a lower confidence bound \(\underset{\sim}{Q}(s,a)\) of the Q-function \(Q^\pi(s,a)\) of its policy at all times in addition to the usual upper confidence bound \(\tilde Q(s,a)\) on the optimal value function \(Q^\star(s,a)\). Deriving tight lower confidence bounds \(\underset{\sim}{Q}(s,a)\) requires new techniques compared to those for upper confidence bounds . All recent optimistic algorithms for tabular MDPs leverage for their upper confidence bounds that \(\tilde Q\) is a confidence bound on \(Q^\star\) which does not depend on the samples. The optimal Q-function is always the same, no matter what samples the algorithm saw. We cannot leverage the same insight for our lower confidence bounds because the Q-function of the current policy \(Q^\pi\) does depend on the samples the algorithm saw. After all, the policy \(\pi\) is computed as a function of these samples. We develop a technique that allows us to deal with this challenge by explicitly incorporating both upper and lower confidence bounds in our bonus terms. It turns out that this technique not only helps achieving tighter lower confidence bounds but also tighter upper-confidence bounds. This is the key for our improved PAC and regret bounds.
Our work provided the final ingredient for PAC bounds for episodic tabular MDPs that are minimax-optimal up to lower-order terms and also established the foundation for policy certificates. In the full paper, we also considered more general MDPs and designed a policy certificate algorithm for so-called finite MDPs with linear side information. This is a generalization of the popular linear contextual bandit setting and requires function approximation. In the future, we plan to investigate policy certificates as a useful empirical tool for deep reinforcement learning techniques and examine whether the specific form of optimism bonuses derived in this work can inspire more sample-efficient exploration bonuses in deep RL methods.
This post is also featured on the Stanford AIforHI blog and is based on work in the following paper:
Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill
Policy Certificates: Towards Accountable Reinforcement Learning
International Conference on Machine Learning (ICML) 2019
Other works mentioned in this post:
DISCLAIMER: All opinions expressed in this post are those of the author and do not represent the views of CMU.
Automated decision-making is one of the core objectives of artificial intelligence. Not surprisingly, over the past few years, entire new research fields have emerged to tackle that task. This blog post is concerned with regret minimization, one of the central tools in online learning. Regret minimization models the problem of repeated online decision making: an agent is called to make a sequence of decisions, under unknown (and potentially adversarial) loss functions. Regret minimization is a versatile mathematical abstraction, that has found a plethora of practical applications: portfolio optimization, computation of Nash equilibria, applications to markets and auctions, submodular function optimization, and more.
In this blog post, we will be interested in showing how one can compose regret-minimizing agents—or regret minimizers for short. In other words, suppose that you are given a regret minimizer that can output good decisions on a set \(\mathcal{X}\) and another regret minimizer that can output good decisions on a set \(\mathcal{Y}\). We show how you can combine them and build a good regret minimizer for a composite set obtained from \(\mathcal{X}\) and \(\mathcal{Y}\)—for example their Cartesian product, their convex hull, their intersection, and so on. Our approach will treat the two regret minimizers, one for \(\mathcal{X}\) and one for \(\mathcal{Y}\), as black boxes. This is tricky: we simply combine them without opening the box, so we must account for the possibility of having to combine very different regret minimizers. On the other hand, the benefit is that we are free to pick the best regret minimizers for each individual set. This is important. For example, consider an extensive-form game: we might know how to build specialized regret minimizers for different parts of the game. We figured out how to combine these regret minimizers to build a composite regret minimizer that can handle the whole game. All material is based off of a recent paper that appeared at ICML 2019.
By the end of the blog post, I will give several applications of this calculus. It enables one to do several things which were not possible before. It also gives a significantly simpler proof of counterfactual regret minimization(CFR), the state-of-the-art scalable method for computing Nash equilibrium in large extensive-form games. The whole exact CFR algorithm falls out naturally, almost trivially, from our calculus.
A regret minimizer is an abstraction of a repeated decision-maker. One way to think about a regret minimizer is as a device that supports two operations:
The decision making is online, in the sense that each decision is made by only taking into account the past decisions and their corresponding loss functions; no information about future losses is available to the regret minimizer at any time. For the rest of the post, we focus on linear losses, that is \(\mathcal{F} = \mathcal{L}\) where \(\mathcal{L}\) denotes the set of all linear functions with domain \(\mathcal{X}\).
The quality metric for a regret minimizer is its cumulative regret. Intuitively, it measures how well the regret minimizer did against the best, fixed decision in hindsight. We can formalize this idea mathematically as the difference between the loss that was cumulated, \(\sum_{t=1}^T \ell^t(\mathbf{x}^t)\), and the minimum possible cumulative loss, \(\min_{\hat{\mathbf{x}}\in\mathcal{X}} \sum_{t=1}^T \ell^t(\hat{\mathbf{x}})\). In formulas, the cumulative regret up to time \(T\) is defined as $$\displaystyle R^T := \sum_{t=1}^T \ell^t(\mathbf{x}^t) – \min_{\hat{\mathbf{x}} \in \mathcal{X}} \sum_{t=1}^T \ell^t(\hat{\mathbf{x}}).$$
“Good” regret minimizers, also called Hannan consistent regret minimizers, are such that their cumulative regret grows sublinearly as a function of \(T\). Several good and general-purpose regret minimizers are known in the literature. Some of them, like follow-the-regularized-leader, online mirror descent, and online (projected) gradient descent work for any convex domain \(\mathcal{X}\). Other are tailored for specific domains, such as regret matching and regret matching plus, both of which are specifically designed for the case in which \(\mathcal{X}\) is a (probability) simplex. However, these general-purpose regret minimizers typically come with two drawbacks:
Given the drawbacks of the traditional approaches, we started to wonder about different ways to construct regret minimizers, until we stumbled upon this intriguing thought: can we construct regret minimizers for composite sets by combining regret minimizers for the individual atoms? The answer is yes
Let’s start from a simple example. Suppose we have a regret minimizer that outputs decisions on a convex set \(\mathcal{X}\), and another regret minimizer that outputs decisions on a convex set \(\mathcal{Y}\). How can we combine them to obtain a regret minimizer for their Cartesian product \(\mathcal{X} \times \mathcal{Y}\)? The natural idea, in this case, is to let the two regret minimizers operate independently:
This process is represented pictorially in Figure 2. We coin this type of pictorial representation a “regret circuit”.
Some simple algebra shows that, at all time \(T\), our strategy guarantees that the cumulative regret \(R^T\) of the composite regret minimizer (as seen from outside of the gray dashed box), satisfies \(R^T = R_\mathcal{X}^T + R_\mathcal{Y}^T\), where \(R_\mathcal{X}^T\) and \(R_\mathcal{Y}^T\) are the cumulative regrets of the regret minimizers for domains \(\mathcal{X}\) and \(\mathcal{Y}\) respectively. Hence, if both of those regret minimizers are “good” (Hannan consistent), than so is the composite regret minimizer.
What about convex hulls? It turns out that this is much trickier! We can try to reuse the same approach as before: we ask the two regret minimizers, one for \(\mathcal{X}\) and one for \(\mathcal{Y}\), to independently output decisions. But now we run into this dilemma as to how we should form a convex combination between the two decisions.
In this case, the regret circuit is shown in Figure 3.
If the loss function \(\ell_{\lambda}^{t-1}\) that enters the extra regret minimizer is set up correctly, and if all three internal regret minimizers are good, one can prove that the composite regret minimizer, as seen from outside of the gray dashed box, is also a good regret minimizer. In particular, a natural way to define \(\ell_{\lambda}^{t}\) is as
\[
\ell^t_\lambda : \Delta^{2} \ni (\lambda_1,\lambda_2) \mapsto \lambda_1 \ell^t(\mathbf{x}^t) + \lambda_2\ell^t(\mathbf{y}^t),
\]
which can be seen as a form of counterfactual loss function.
It turns out that the two regret circuits we’ve seen so far—one for the Cartesian product and one for the convex hull of two sets—are already enough to give a very natural proof of the counterfactual regret minimization (CFR) framework, a family of regret minimizers, specifically tailored for extensive-form games. CFR has been the de facto state of the art for the past 10+ years for computing approximate Nash equilibria in large games and has been one of the key technologies that allowed to solve large Heads-Up Limit and No-Limit Texas Hold’Em. The basic intuition is as follows (all details are in our paper). Consider for example the sequential action space of the first player in the game of Kuhn poker (Figure 4, left):
In other words, we can represent the strategy of the player by composing convex hulls and Cartesian products, following the structure of the game (Figure 4, right).
Since we can express the set of strategies in the game by composing convex hulls and Cartesian products, it should now be clear how our framework assists us in constructing a regret minimizer for this domain.
After having seen Cartesian products and convex hulls, a natural question is: what about intersections and constraint satisfaction? In this case, we assume to have access to a good regret minimizer for a domain \(\mathcal{X}\), and we want to somehow construct a good regret minimizer for the curtailed set \(\mathcal{X} \cap \mathcal{Y}\).
It turns out that, in general, these constraining operations are more costly than “enlarging” operations such as convex hull, Minkowski sums, Cartesian products, etc. In the paper, we show two different circuits:
The main idea for both circuits is to use the regret minimizer for \(\mathcal{X}\) output decisions, and then penalize infeasible choices by injecting extra penalization terms in the loss functions that enter the regret minimizer for \(\mathcal{X}\). In the case of the circuit that guarantees feasibility, the decisions are also projected onto \(\mathcal{X} \cap \mathcal{Y}\) before they are output by the composite regret minimizer. Figure 5 shows the resulting regret circuits, where \(d\) is the distance-generating function used in the projection (for example, a good choice could be \(d(\mathbf{x}) = \|\mathbf{x}\|^2_2\)), and \(\alpha^t\) is a penalization coefficient.
While here we are not interested in all the details of this circuit, we remark an interesting observation: the regret circuit is a constructive proof of the fact that we can always turn an infeasible regret minimizer into a feasible one by projecting onto the feasible set, outside the loop!
Armed with these new intersection circuits, we can show that the recent Constrained CFR algorithm can be constructed as a special example via our framework. Our exact (feasible) intersection construction leads to a new algorithm for the same problem as well.
Another application is in the realm of optimistic/predictive regret minimization. This is a recent subfield of online learning, whose techniques can be used to break the learning-theoretic barrier \(O(T^{-1/2})\) on the convergence rate of regret-based approaches to saddle points (for example, Nash equilibria). In a different ICML 2019 paper, we used our calculus to prove that, under certain hypotheses, CFR can be modified to have a convergence rate of \(O(T^{-3/4})\) to Nash equilibrium, instead of \(O(T^{-1/2})\) as in the original (non-optimistic) version.
Regret circuits have already proved to be useful in several applications, mostly in game theory. The fact that we can combine potentially very different regret minimizers as black boxes is very appealing because it enables to choose the best algorithm for each set that is being composed, and conquer different parts of the design space with different techniques. In the paper, we show regret circuits for several convexity-preserving operations, including convex hull, Cartesian product, affine transformations, intersections, and Minkowski sums. However, several questions remain open:
Figure 1: Visualization of supervised neighborhoods for local explanation with MAPLE. When seeing the new point \(X = (1, 0, 1)\), this tree determines that \(X_2\) and \(X_6\) are its neighbors and gives them weight 1 and gives all other points weight 0. MAPLE averages these weights across all the trees in the ensemble.
Machine learning is increasingly used to make critical decisions such as a doctor’s diagnosis, a biologist’s experimental design, and a lender’s loan decision. In these areas, mistakes can be the difference between life and death, can lead to wasted time and money, and can have serious legal consequences.
Because of the serious potential ramifications of using machine learning in these domains, it falls onto machine learning practitioners to ensure that their models are robust and to foster trust with the people who interact with their models. Broadly speaking, meeting these two goals is the objective of interpretability and is achieved by iteratively: explaining both global and local behavior of a model (increasing understanding), checking that these explanations make sense (developing trust), and fixing any identified problems (preventing bad failures).
Meeting these two goals is a very difficult task and interpretability faces many challenges, but we will be focusing on two in particular:
Our proposed method, MAPLE, couples classical local linear modeling techniques with a dual interpretation of tree ensembles (which aggregate the predictions of multiple decision trees), both as a supervised neighborhood approach and as a feature selection method (see Fig. 1). By doing this, we are able to slightly improve accuracy while producing multiple types of explanations.
Before diving into the technical details of MAPLE and how it works as an interpretability system (both for explaining its own predictions and for explaining the predictions of another model), we provide an overview and comparison of the main types of explanations.
At a high level, there are three main types of explanations:
Example-based explanations are clearly distinct from the other two explanation types, as the former relies on sample data points and the latter two on features. Furthermore, local and global explanations themselves capture fundamentally different characteristics of the predictive model. To see this, consider the toy datasets in Fig. 2 generated from three univariate functions.
Figure 2: Toy datasets from left to right (a) Linear (b) Shifted Logistic (c) Step Function.
Generally, local explanations are better suited for modeling smooth continuous effects (Fig. 2a). For discontinuous effects (Fig. 2c) or effects that are very strong in a small region (Fig. 2b), local explanations either fail to detect the effect or make unusual predictions, depending on how the local neighborhood is defined (i.e., whether or not it is defined in a supervised manner, more on this in the ‘Supervised vs Unsupervised Neighborhood’ section). We will call such effects global patterns because they are difficult to detect or model with local explanations.
Conversely, global explanations are less effective at explaining continuous effects and more effective at explaining global patterns. This is because they tend to be rule-based models that use feature discretization or binning. This processing doesn’t lend itself easily to modeling continuous effects (you need many small steps to approximate a linear model well) but does lend itself towards modeling the abrupt changes around global patterns (because those effects create natural cut-offs for the feature discretization or binning).
Most real datasets have both continuous and discontinuous effects and, therefore, it is crucial to devise explanation systems that can capture, or are at least aware of, both types of effects.
Because local explanations are actionable for (they answer the question “what could I have done differently to get the desired outcome?”) and relevant to (it is not particularly helpful to a person to know how the model behaves for an entirely different person) the people impacted by machine learning systems, we focus on them for this work.
The goal of a local explanation, \(g\), is to approximate our learned model, \(f\), well across some neighborhood of the input space, \(N_x\). Naturally, this leads to the fidelity-metric: \(E_{x’ \sim N_x}[ (g(x’) – f(x’))^2]\). The choices of \(g\) and \(N_x\) are important and should often be problem specific. Similar to previous work, we assume that \(g\) is a linear function.
Figure 3: A simple way of generating a local explanation that is very similar to LIME. From left to right, 1) Start with a point that you want to explain, 2) Define a neighborhood around that point, 3) Sample points from that neighborhood, and 4) Fit a linear model to the model’s predictions at those sampled points
MAPLE (Plumb et al. 2018) modifies tree ensembles to produce local explanations that are able to detect global patterns and to produce example-based explanations; these modifications are built on work from (A. Bloniarz et al. 2016) and (S. Kazemitabar et al. 2017). Importantly, we find that doing this typically improves the predictive accuracy of the model and that the resulting local explanations have high fidelity.
At a high level, MAPLE uses the tree ensemble to identify which training points are most relevant to a new prediction and uses those points to fit a linear model that is used both to make a prediction and as a local explanation. We will now make this more precise.
Given training data \((x_i, y_i)\) for \(i= 1, \ldots, n\), we start by training an ensemble of trees on this data, \(T_i\) for \(i= 1, \ldots, K\). For a point \(x\), let \(T_k(x)\) be the index of the leaf node of \(T_k\) that contains \(x\). Suppose that we want to make a prediction at \(x\) and also give an explanation for that prediction.
To do this, we start by assigning a similarity weight to each training point, \(x_i\), based on how often the trees put \(x_i\) and \(x\) in the same leaf node. So we define \(w_i = \frac{1}{K} \sum_{j=1}^K \mathbb{I}[T_j(x_i) = T_j(x)]\). This is how MAPLE produces example-based explanations; training points with a larger \(w_i\) will be more relevant to the prediction/explanation at \(x\) than training points with smaller weights. An example of this process for a single tree is shown in Fig. 1.
To actually make a prediction/explanation, we solve the weighted linear regression problem \(\hat\beta_x = \text{argmin}_\beta \sum_{i=1}^n w_i (\beta^T x_i – y_i)^2\). Then MAPLE makes the prediction \(f_{MAPLE}(x) = \hat\beta_x^T x\) and gives the local explanation \(\hat\beta_x\). Because the \(w_i\) depend on the training data (i.e., the most relevant points depend on the labels \(y_i\)), we say that \(\hat\beta_x\) uses a supervised neighborhood.
When LIME defines its local explanations, it optimizes for the fidelity-metric with \(N_x\) set as a probability distribution centered on \(x\). So we say it uses an unsupervised neighborhood. As mentioned earlier, the behavior of local explanations around global patterns depends on whether or not they use a supervised or unsupervised neighborhood.
Why don’t unsupervised neighborhoods detect global patterns? Near a global pattern, an unsupervised neighborhood will sample points on either side of it. Consequently, if the explanation is linear, it will smooth the global pattern (i.e., fail to detect it). Importantly, the only indication that something might be awry is that the explanation will have lower fidelity.
Although sometimes this smoothing is a good enough approximation, it would be better if the explanation detected the global pattern. For example, if we interpret Fig. 2b as the probability of giving someone a loan as their income increases, we can see that smoothing the global effect causes the explanation to give overly optimistic advice.
How are supervised neighborhoods different? On the other hand, supervised neighborhoods will tend to sample points only on one side of the global pattern and consequently will not smooth it. For example, in Fig. 2c, MAPLE will predict a slope of zero at almost all points because the function is flat across each one of its three learned neighborhoods.
But this clearly is also not a desirable behavior since it would imply that this feature does not matter for the prediction. Consequently, we introduce a technique to determine if a coefficient is zero/small because it does not matter or if it is zero/small because it is near a global pattern.
We do this by examining the probability distribution over the features induced by the weights, \(w_i\), and training points, \(x_i\), and determining where the explanation can be applied. Note that this distribution is defined using the weights learned by MAPLE. When a point is near a global pattern, this distribution becomes skewed and we can detect it. A brief example is shown bellow in Fig. 4 (see the paper for complete details).
Figure 4: An example of the local neighborhoods learned by MAPLE as we perform a grid search across the active feature of each of the toy datasets from Fig. 2. Notice that we can detect the strong effect by the small neighborhood in the steep region of the logistic curve (middle) and the discontinuities in the step function (right).
In summary, by using the local training distribution that MAPLE learns around a point, we can determine whether or not that point is near a global pattern.
When evaluating the effectiveness of MAPLE, there are three main questions:
We evaluated these questions on several UCI datasets [Dheeru 2017] and will summarize our results here (for full details, see the paper).
1. Do we sacrifice accuracy to gain interpretability? No, in fact MAPLE is almost always more accurate than the tree ensemble it is built on.
2. How well do its local explanations explain its own predictions? When comparing MAPLE’s local explanation to an explanation fit by LIME to explain the predictions made by MAPLE, MAPLE produces substantially better explanations (as measured by the fidelity metric).
This is not surprising since this is asking MAPLE to explain itself, but it does indicate that MAPLE is an improvement on tree ensembles in terms of both accuracy and interpretability.
3. How well can it explain a black-box model? When we use MAPLE or LIME to explain a black-box model (in this case a Support Vector Regression model), MAPLE often produces better explanations (again, measured by the fidelity metric).
By using leaf node membership as a form of supervised neighborhood selection, MAPLE is able to modify tree ensembles to be substantially more interpretable without the typical accuracy-interpretability trade-off. Additionally, it is able to provide feedback for all three types of explanations: local explanations via training a linear model, example-based explanations via highly weighted neighbors, and finally, detection of global patterns by using the supervised neighborhoods.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why should i trust you?: Explaining the predictions of any classifier.” Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 2016.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “Anchors: High-precision model-agnostic explanations.” Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
Gregory Plumb, Denali Molitor, and Ameet S. Talwalkar. “Model Agnostic Supervised Local Explanations.” Advances in Neural Information Processing Systems. 2018.
A. Bloniarz, C. Wu, B. Yu, A. Talwalkar. Supervised Neighborhoods for Distributed Nonparametric Regression. AISTATS, 2016.
S. Kazemitabar, A. Amini, A. Bloniarz, A. Talwalkar. Variable Importance Using Decision Trees. NIPS, 2017.
Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.
DISCLAIMER: All opinions expressed in this post are those of the author and do not represent the views of CMU.
]]>Figure 1. Fine-tuning a language model to predict EEG data. The encoder is pretrained on Wikipedia to predict the next word in a sequence (or previous word for the backward LSTM). We use the contextualized embeddings from the encoder as input to a decoder. The decoder uses a convolution to create embeddings for each pair of words, which, along with word-frequency and word-length become the basis for a linear layer to predict EEG responses. The model is fine-tuned to predict this EEG data. In this example the model is jointly trained to predict the N400 and P600 EEG responses.
Imagine for a moment that we can take snippets of text, give them to a computational model, and that the model can perfectly predict some of the brain activity recorded from a person who was reading the same text. Can we learn anything about how the brain works from this model? If we trust the model, then we can at least identify which parts of the brain activity are related to the text. Beyond this though, what we learn from the model depends on how much we know about the mechanisms it uses to make its predictions. Given that we want to understand these mechanisms, and that models produced by deep learning can be difficult to interpret, deep learning seems at first glance not to be a good candidate for analyzing language processing in the brain. However, deep learning has proven to be amazingly effective at capturing statistical regularities in language (and other domains). This effectiveness motivated us to see whether a deep learning model is able to predict brain activity from text well, and importantly, whether we can gain any understanding about the brain activity from the predictions. It turns out that the answer to both questions is yes.
One of the open questions in the study of how the brain processes language is how word meanings are integrated together to form the meanings of sentences, passages and dialogues. Electroencephalography (EEG) is a tool that is commonly used to study those integrative processes. In a recent paper, we propose to use fine-tuning of a language model and multitask learning to better understand how various language-elicited EEG responses are related to each other. If we can better understand these EEG responses and what drives them, then we can use that understanding to better study language processing in people.
In our analysis, we use EEG observations of brain activity recorded as people read sentences. Several different kinds of deviations from baseline measurements of activity occur as people read text. The most well studied of these is called the N400 response. It is a Negative deflection in the electrical activity (relative to a baseline) that occurs around 400 milliseconds after the onset of a word (thus N400), and it has been associated with semantic effort. If a word is expected in context — for example “I like peanut butter and jelly” versus “I like peanut butter and roller skates” — then the expected word jelly will elicit a reduced N400 response compared to the unexpected roller.
In the data we analyze (made available by Stefan Frank and colleagues) six different language associated responses are considered. Three of these — the N400, PNP, and EPNP responses — are generally considered markers for semantic processes in the brain while the other three — the P600, LAN, and ELAN — are generally considered markers for syntactic processes in the brain. The division of these EEG responses into indicators for syntactic and semantic processes is controversial, and there is considerable debate about what each of the responses signifies. The P600, for example, is thought by some researchers to be triggered by syntactic violations, as in “The plane took we to paradise and back” while others have noted that it can also be triggered by semantic role violations, as in “Every morning at breakfast the eggs would eat …”, and still others have questioned whether the P600 is language specific or rather a marker for any kind of rare event. One possibility is that it is associated with an attentive process invoked to reconcile conflicting information from lower level language processing. In any case, a clearer picture of the relationship between all of the EEG responses and between text and the EEG responses would make them better tools for investigating language processing in the brain.
Rather than having discrete labeling of whether each of the six EEG responses occurred as a participant read a given word, in this dataset the EEG responses are defined continuously as the average potential of a predefined set of EEG sensors during a predefined time-window (relative to when a word appears). This gives us six scalar values per word per experiment participant, and we average the values across the participants to give six final scalar values per word.
To predict these six scalar values for each word, we use a pretrained bidirectional LSTM as an encoder. We anticipate that the EEG responses occur in part as a function of a shift in the meaning or structure of the incoming language. For example, the N400 is associated with semantic effort and surprisal, so we might expect that the N400 would be some function of a difference between adjacent word embeddings. Because of this intuition, we pair up the embeddings output from the encoder by putting them through a convolutional layer that can learn functions on adjacent word embeddings. We use the pair embeddings output by the convolution, along with the word length and log-probability of the word as the basis for predicting the EEG responses. The EEG responses are predicted from this basis using a linear layer. The forward and backward LSTMs are pretrained independently on the WikiText-103 dataset to predict the next and previous words respectively from a snippet of text. We fine-tune the model by training the decoder first and keeping the encoder parameters fixed, and then after that we continue training by also modifying the final layer of the LSTM for a few epochs.
A natural question is whether these EEG measures of brain activity can be predicted from the text at all, and whether all of this deep learning machinery actually improves the prediction compared to a simpler model. As our measure of accuracy, we use the proportion of variance explained — i.e. we normalize the mean squared error on the validation set by the variance on the validation set and subtract that number from 1: \(\mathrm{POVE} = 1 – \frac{\mathrm{MSE}}{\mathrm{variance}}\). We compare the accuracy of using the decoder on top of three different encoders: an encoder which completely bypasses the LSTM (i.e. the output embeddings are the same as the input embeddings to the encoder), an encoder which is a forward-only LSTM, and an encoder which is the full bidirectional LSTM.
Surprisingly, we see that all six of the EEG measures can be predicted at above chance levels (\(0\) is chance here since guessing the mean would give us \(\mathrm{POVE}\) of \(0\)). Previous work (here and here) has found that only some of the EEG measures are predictable, but that work did not directly try to predict the brain activity from the text. Instead, it used an estimate of the surprisal (the negative log-probability of the word in context), and an estimate of the syntactic complexity to predict the EEG data. Those intermediate values have the benefit of being interpretable, but they lose a lot of the pertinent information.
We also see that the full bidirectional encoder is better able to predict the brain activity than the other encoders. The comparison between encoders is not completely fair because there are more parameters in the forward-only encoder than the embedding-only encoder, and more parameters than both of those in the bidirectional encoder, so part of the reason that the bidirectional encoder might be better is simply that it has more degrees of freedom to work with. Nonetheless, this result suggests that the context matters for the prediction of the EEG signals, which means that there is opportunity to learn about the features in the language stream that drive the EEG responses.
It’s good to see that the deep learning model can predict all of the EEG responses, but we also want to learn something about those responses. We use multitask learning to accomplish that here. We train our network using \(63 = \binom{6}{1} + \binom{6}{2} + … + \binom{6}{6}\) variations of our loss function. In each variation, we choose a subset of the six EEG signals and include a mean squared error term for the prediction of each signal in that subset. For example, one of the variations includes just the N400 and the P600 responses, so there are mean squared error terms for the prediction of the N400 and the prediction of the P600 in the loss function for that variation, but not for the LAN. We only make predictions for content words (adjectives, adverbs, auxiliary verbs, nouns, pronouns, proper nouns, and verbs), so if there are \(B\) examples in a mini-batch, and example \(b\) has \(W_b\) content words, and if we let the superscripts \(p,a\) denote the predicted and actual values for an EEG signal respectively, then the loss function for the N400 and P600 variation can be written as:
$$\frac{1}{\sum_{b=1}^B W_b} \sum_{b=1}^B \sum_{w=1}^{W_b} (\mathrm{P600}^{p}_{b,w} – \mathrm{P600}^{a}_{b,w})^2 + (\mathrm{N400}^{p}_{b,w} – \mathrm{N400}^{a}_{b,w})^2$$
The premise of this method is that if two or more EEG signals are related to each other, then including all of the related signals as prediction tasks should create a helpful inductive bias. With this bias, the function that the deep learning model learns between the text and an EEG signal of interest should be a better approximation of the true function, and therefore it should generalize better to unseen examples.
We filter the results to keep (i) the variations that include just a single EEG response in the loss function (the top bar in each group below), (ii) the variations that best explain each EEG response (the bottom bar in each group below), and (iii) the variations which are not significantly different from the best variations and which include no more EEG responses in the loss function than the best variation, i.e. all simpler combinations of EEG responses which perform as well as the best combination (all the other bars). For the N400, where the best variation does not include any other EEG signals, we also show how the proportion of variance explained changes when we include each of the other EEG signals.
For each target EEG signal other than the N400, it is possible to improve prediction by using multitask learning. As Rich Caruana points out in his work on multitask learning, a target task can be improved by auxiliary tasks even when the tasks are unrelated. However, our results are suggestive of relationships between the EEG signals. It’s not the case that training with more EEG signals is always better, and the pattern of improvements for different variations doesn’t look random. The improvements also don’t follow the pattern of raw correlations between the EEG signals (see our paper for the correlations).
Some of the relationships we see here are expected from current theories of how each EEG response relates to language processing. The LAN/P600 and ELAN/P600 relationship is expected based both on prior studies where they have been observed together and theory that the ELAN/LAN responses occur during syntactic violations and the P600 occurs during increased syntactic effort. Our results also suggest some relationships which are not as expected, but which have plausible explanations. For example, some researchers believe that the ELAN and LAN responses mark working memory demands, and if this is so, then those responses might be expected to be related to the other signals that track language processing demands of any kind. That could explain why they seem to widely benefit (and benefit from) the prediction of other signals. However, the apparent isolation of the N400 from this benefit would be surprising in that case.
We need to be a little careful about over-interpreting the results here; the way that the EEG responses are defined in this dataset means that several of them are spatially overlapping and close to each other temporally, so some signals may spill-over into others. Future studies will be required to tease apart the possibilities suggested by this analysis, but we believe that this methodology is a promising direction. Multitask learning can help us understand complex relationships between EEG signals. We can also partially address the concern about signal spill-over by including other prediction tasks.
Two additional tasks we can include are prediction of self-paced reading times (in which words are shown one-by-one and the experiment participant presses a button to advance to the next word) and eye-tracking data. Both are available from different experiment participants for the sentences that the EEG signals were collected on. Self-paced reading times and eye-tracking data can both be thought of as measures of reading comprehension difficulty, so we expect that they should be related to the EEG data. Indeed, we see that when these tasks are used in training, both benefit the prediction of the EEG data compared to training on the target EEG signal alone. This result is really interesting because it cannot be explained by any spill-over effect. It suggests that the model might really be learning about some of the latent factors that underlie both EEG responses and behavior (for the detailed results and further discussion of the behavioral data, please see our paper).
It’s really exciting to see how well the EEG signals can be predicted using one of the latest language models, and multitask learning gives us some insight into how the EEG signals relate to each other and to behavioral data. While this analysis method is for now largely exploratory and suggestive, we hope to extend it over time to gain more and more understanding of how the brain processes language. If you’re interested in more information about the method or further discussion of the results, please check out our paper here.
DISCLAIMER: All opinions expressed in this post are those of the author and do not represent the views of CMU.
]]>Over the past decade, artificial intelligence (AI) has achieved remarkable success in many fields such as healthcare, automotive, and marketing. The capabilities of sophisticated, autonomous decision systems driven by AI keep evolving and moving from lab to reality. Many of these systems are black-box which means we don’t really understand how they work and why they reach such decisions.
As black-box decision systems have come into greater use, they have also come under greater criticism. One of the main concerns is that it is dangerous to rely on black-box decisions without knowing the way they are made. Here is an example of why they can be dangerous.
Risk-assessment tools have been widely used in the federal and state courts to facilitate and improve judges’ decisions in the criminal justice processes. They provide defendants’ future criminal risk based on socio-economic status, family background, and other factors. In May 2016, ProPublica claimed that one of the most widely used risk-assessment tools, COMPAS, was biased against black defendants while being more generous to white defendants [link]. Northpointe, a for-profit company that provides the software, disputed the analysis but refused to disclose the software’s decision mechanism. So it is not possible for either stakeholders or the public to see what might be actually creating the disparity.
It is dangerous to rely on black-box decisions without knowing the way they are made. Here, we raise a question: How can we possibly go about resolving this concern? Explaining how a black-box decision system works or why it reaches such decisions helps to decide whether or not to follow its decisions. The need for interpretability is especially urgent in fields where black-box decisions can be life-changing and have significant consequences, such as disease diagnosis, criminal justice, and self-driving cars.
What makes a ‘good’ explanation for a black-box? Assume that you give a black-box predictive model an image of an apple. You open the black-box and explain why it believes this is indeed an apple on the image. Simply saying that “it is red, so this is an apple” is not sufficient to justify your thought, but you should also avoid redundant explanation. It is important to give enough information concisely in explaining a black-box decision system. In other words, explanations should be brief but comprehensive.
How can we take into account both briefness and comprehensiveness for explaining a black-box? Our work uses an information theoretic perspective to quantify the idea of briefness and comprehensiveness.
The information bottleneck principle (Tishby et al., 2000) provides an appealing information theoretic view for learning supervised models by defining what we mean by a ‘good’ representation. The principle says that the optimal model transmits as much information as possible from its input to its output through a compressed representation called the information bottleneck. And the information bottleneck is a good representation that is maximally informative about the output while compressive about a given input. Recently, Shwartz-Ziv et al. (2017) and Tishby et al. (2015) showed that the principle also applies to deep neural networks and each layer of the networks can work as an information bottleneck.
We adopt the information bottleneck principle as a criterion for finding a ‘good’ explanation. In the information theoretic view, we define a brief but comprehensive explanation as maximally informative about the black-box decision while compressive about a given input. In other words, the explanation should maximally compress the mutual information regarding an input while preserving as much as possible mutual information regarding its output.
We introduce the variational information bottleneck for interpretation (VIBI), a system-agnostic information bottleneck model that provides a brief but comprehensive explanation for every single decision made by a black-box.
VIBI is composed of two parts: explainer and approximator, each of which is modeled by a deep neural network. Using the information bottleneck principle, VIBI learns an explainer that favors brief explanations while enforcing that the explanations alone suffice for an accurate approximation to the black-box. See the following illustration for an illustration of VIBI.
For each instance, the explainer returns a probability whether a chunk of features, called a cognitive chunk, will be selected as an explanation or not. Cognitive chunk is defined as a group of raw features that work as a unit to be explained and whose identity is recognizable to a human, such as a word, phrase, sentence or a group of pixels. The selected chunks act as an information bottleneck that is maximally compressed about input and informative about the decision made by a black-box system on that input.
Now, we formulate the following optimization problem inspired by the information bottleneck principle to learn the explainer and approximator:
$$ p(\mathbf{z} | \mathbf{x}) = \mathrm{argmax}_{p(\mathbf{z} | \mathbf{x}), p(\mathbf{y} | \mathbf{t})} ~~\mathrm{I} ( \mathbf{t}, \mathbf{y} ) – \beta~\mathrm{I} ( \mathbf{x}, \mathbf{t} )$$ where \( \mathrm{I} ( \mathbf{t}, \mathbf{y} ) \) represents the sufficiency of information retained for explaining the black-box output \( \mathbf{y} \), \(-\mathrm{I} ( \mathbf{x}, \mathbf{t} ) \) represents the briefness of the explanation \( \mathbf{t} \), and \( \beta \) is a Lagrange multiplier representing a trade-off between the two.
The current form of information bottleneck objective is intractable due to the mutual informations and the non-differentiable sample \( \mathbf{z} \). We address these challenges as follows.
Variational Approximation to Information Bottleneck Objective
The mutual informations \( \mathrm{I} ( \mathbf{t}, \mathbf{y} ) \) and \( \mathrm{I} ( \mathbf{x}, \mathbf{t} ) \) are computationally expensive to quantify (Tishby et al., 2000; Chechik et al., 2005). In order to reduce the computational burden, we use a variational approximation to our information bottleneck objective: $$\mathrm{I} ( \mathbf{t}, \mathbf{y} )~-~\beta~\mathrm{I} ( \mathbf{x}, \mathbf{t} )
\geq \mathbb{E}_{\mathbf{y} \sim p(\mathbf{x})} \mathbb{E}_{\mathbf{y} | \mathbf{x} \sim p(\mathbf{y} | \mathbf{x})} \mathbb{E}_{\mathbf{t} | \mathbf{x} \sim p(\mathbf{t} | \mathbf{x})} \left[ \log q(\mathbf{y} | \mathbf{t}) \right] ~-~\beta~\mathbb{E}_{\mathbf{x}\sim p(\mathbf{x})} \mathrm{KL} (p(\mathbf{z}| \mathbf{x}), r(\mathbf{z})) $$
Now, we can integrate the Kullback-Leibler divergence \( \mathrm{KL} (p(\mathbf{z}| \mathbf{x}), r(\mathbf{z})) \) analytically with proper choices of \( r(\mathbf{z}) \) and \( p(\mathbf{z}|\mathbf{x}) \). We also use the empirical data distribution to approximate \( p(\mathbf{x}, \mathbf{y}) = p(\mathbf{x})p(\mathbf{y}|\mathbf{x}) \).
Continuous Relaxation and Re-parameterization
We use the generalized Gumbel-softmax trick (Jang et al., 2017; Chen et al., 2018), which approximates the non-differentiable categorical subset sampling with Gumbel-softmax samples that are differentiable. This trick allows using standard backpropagation to compute the gradients of the parameters via reparameterization.
VIBI provides instance-specific keywords to explain an LSTM sentiment prediction model using Large Movie Review Dataset, IMDB.
The keywords such as “waste,” and “horrible,” are selected for the negative-predicted movie review, while keywords such as “most fascinating,” explain the model’s positive-predicted movie review. Also, we could see that the LSTM sentiment prediction model makes a wrong prediction for a negative review because the review includes several positive words such as ‘enjoyable’ and ‘exciting’.
VIBI also provides instance-specific key patches containing \( 4 \times 4 \) pixels to explain a CNN digit recognition model using the MNIST image dataset.
The first two examples show that the CNN recognizes digits using both shapes and angles. In the first example, the CNN characterizes ‘1’s by straightly aligned patches along with the activated regions although ‘1’s in the left and right panels are written at different angles. Contrary to the first example, the second example shows that the CNN recognizes the difference between ‘9’ and ‘6’ by their differences in angles. The last two examples show that the CNN catches a difference of ‘7’s from ‘1’s by patches located on the activated horizontal line on ‘7’ (see the cyan circle) and recognizes ‘8’s by two patches on the top of the digits and another two patches at the bottom circle.
We assume that a better explanation allows humans to better infer the black-box output given the explanation. Therefore, we asked humans to infer the output of the black-box system (Positive/Negative/Neutral) given five keywords as an explanation generated by VIBI and other competing methods (Saliency, LIME, and L2X). Each method was evaluated by the human intelligences on Amazon Mechanical Turk who are awarded the Masters Qualification (i.e. high-performance workers who have demonstrated excellence across a wide range of tasks). We also evaluated the interpretability for the CNN digit recognition model using MNIST. We asked humans to directly score the explanation on a 0 to 5 scale (0 for no explanation, 1-4 for insufficient or redundant explanation and 5 for concise explanation). Each method was evaluated by 16 graduate students at the School of Computer Science, Carnegie Mellon University who have taken at least one graduate-level machine learning class.
We assessed fidelity of the approximator by prediction performance with respect to the black- box output. We introduce two types of formalized metrics to quantitatively evaluate the fidelity: approximator fidelity and rationale fidelity.
Approximator fidelity implies the ability of the approximator to imitate the behaviour of a black-box. As shown above, VIBI and L2X outperform the others in approximating the black-box models. However, it does not mean both approximators are same in fidelity. See below.
Rationale fidelity implies how much the selected chunks contribute to the approximator fidelity. As shown above, the selected chunks of VIBI account for more approximator fidelity than L2X. Note that L2X is a special case of VIBI having the information bottleneck trade-off parameter \( \beta = 0 \) (i.e. not using the compressiveness constraint \( −\mathrm{I} ( \mathbf{x}, \mathbf{t} ) \)). Therefore, compressing information through the explainer achieves not only conciseness of explanation but also better fidelity of explanation to a black-box.
Note that the number of cognitive chunks to be selected, \( k \), should be given in advance. It also impacts conciseness of the actual total explanation and should be chosen carefully. In our analysis, we choose \( k \) as the minimum number that exceeds a certain fidelity.
Further details can be found here. The code is publicly available here.
DISCLAIMER: All opinions expressed in this post are those of the author and do not represent the views of CMU.
]]>Consider the following problem: we are given a set of items, and the goal is to pick the “best” ones from them. This problem appears very often in real life — for example, selecting papers in conference peer review, judging the winners of a diving competition, picking city construction proposals to allocate funds, etc. In these examples, a common procedure is to assign the items (papers/contestants/proposals) to people (reviewers/judges/citizens) and ask them for their opinion. Then, we aggregate their opinions, and select the best items accordingly. For simplicity, we assume that each item pertains a “true” quality (or “true” value), which is a real number that precisely quantifies how good the item is, and this number is unknown to us. The best items are then the ones with the highest true qualities.
There are a number of sources of biases that may arise when soliciting evaluations from people. In this blog post, we focus on miscalibration, which refers to people using different scales when assigning numerical scores. As a running example throughout this blog post, we consider conference peer review. Peer review is a process common in scientific publication. When researchers submit a paper to a conference, the conference organizers will assign the paper to a few “peer reviewers”, who are researchers in the same field, and ask these reviewers to evaluate the quality of the paper. Based on the reviews and comments written by the peer reviewers, the conference organizers make a decision on whether to accept or reject the paper.
It might be the case that some reviewers are lenient and always provide scores in the range [6, 10] whereas some reviewers are more stringent and provide scores in the range [0, 4]. Or it might be the case that one reviewer is moderate whereas the other is extreme — the first reviewer’s 2 is equivalent to the second reviewer’s 1 whereas the first reviewer’s 3 is equivalent to the second reviewer’s 9. Indeed, the issue of miscalibration has been widely noted in the literature:
“The rating scale as well as the individual ratings are often arbitrary and may not be consistent from one user to another.”
[Ammar & Shah, 2012]
“A raw rating of 7 out of 10 in the absence of any other information is potentially useless.”
[Mitliagkas et al. 2011]
So what should we do with the miscalibrated scores we receive? There are two common approaches to address miscalibration. One approach is to make simple assumptions about the nature of miscalibration. For example, in the past, people have assumed that miscalibration is linear. That is, when a reviewer reviews a paper, the score reported by this reviewer will be the true quality of the paper multiplied by a positive scalar, followed by an addition or subtraction of another scalar. However, calibration issues with human-provided scores are often much more complex, and therefore we have not seen much success with these simple models in real conference peer review settings.
The second approach is to use only the ranking of the items. We use “ranking” to refer to the ordering of items. For example, if a reviewer gives scores of 5, 9, and 3 to three papers respectively, then the “ranking” from this reviewer is that the second paper is better than the first paper, and the first paper is better than the third. The ranking can be obtained by sorting the reviewer’s scores of papers (for simplicity, assume there are no ties), or by directly asking the reviewers to rank the papers. In practice, rankings are often used instead of numerical scores. For example, quoting the landmark paper by Freund et al.:
“[Using rankings instead of ratings] becomes very important when we combine the rankings of many viewers who often use completely different ranges of scores to express identical preferences.”
[Freund et al. 2003]
It is a folklore belief that without making any simplifying assumptions on miscalibration, the only useful information is the underlying ranking. In our AAMAS 2019 paper, we examine the fundamental question of whether this folklore belief is true. Concretely, we present theoretical results that contest this belief. We show that, if we use the rating data instead of only the ranking data, we can do strictly better in selecting the best items, even amidst high levels of miscalibration.
For simplicity, let’s first consider the following toy problem: given two papers and two reviewers, we want to select the better paper out of the two. Suppose each reviewer is assigned one paper, and this assignment is done uniformly at random. The two reviewers provide their evaluations (e.g., on a scale from 0 to 10) for the respective paper they review. The reviewers’ rating scales may be miscalibrated. This miscalibration can be arbitrary, and is unknown to us. Since each reviewer only provides a single score, the ranking data collected from each reviewer is vacuous. An algorithm based on rankings can’t really do better than randomly guessing which paper is better. The question we aim to answer is thus: in such a case, can we do strictly better than a random guess, by making use of the scores given by the two reviewers, instead of just their rankings?
Interestingly, as we will explain shortly, the answer turns out to be “yes”. This contests the forklore belief that under arbitrary miscalibration, the only useful information in ratings is the underlying ranking.
To understand the general problem of miscalibration, we first consider a simplified setting, and the key ideas from this setting will be used as a crucial building block for more general algorithms. In this simplified setting, assume that we have two papers with unknown quality values \(x_1, x_2\in \mathbb{R}\), and two reviewers. The two papers are respectively assigned to the two reviewers uniformly at random. That is, paper 1 is assigned to reviewer 1 and paper 2 to reviewer 2 with probability 0.5 (otherwise, paper 1 is assigned to reviewer 2 and paper 2 to reviewer 1). For each reviewer \(i \in \{1, 2\}\), we use a “calibration function” \(f_i: \mathbb{R} \rightarrow \mathbb{R}\) to represent the miscalibration of that reviewer. This function is a mapping from the true quality of a paper, to the score that the reviewer will report for this paper. That is, if the true value of a paper evaluated by reviewer \(i\) is \(x\) then the reviewer will report \(f_i(x)\). For convenience of exposition, we normalize the rating scale such that that the ratings lie in the range of [0, 1], so we have \(f_i : \mathbb{R} \rightarrow [0, 1]\).
We assume that the calibration functions \(f_1\)and \(f_2\) are strictly monotonically increasing. That is, if a reviewer were assigned a paper of higher quality, then the reviewer would give a higher score to that paper (but we don’t know by how much), than to one of lower quality. Other than that, the values \(x_1, x_2\) and functions \(f_1, f_2\) can be arbitrary. Let us denote the reported score for paper 1 (from its assigned reviewer) as \(y_1\), and the reported score for paper 2 (from its assigned reviewer) as \(y_2\). Given the scores \(y_1, y_2\), and the assignment of which paper is assigned to which reviewer, our goal is to tell which paper is better (i.e., infer whether \(x_1 > x_2\) or \(x_1 < x_2\)).
At first, it may seem impossible to extract any useful information from the numerical scores, as the two papers are reviewed by different reviewers, and therefore the scores can be different due to either miscalibration, or differences in the true paper qualities. Say, reviewer 1 is assigned paper 1, and gives a score of 0.5, and reviewer 2 is assigned paper 2 and gives a score of 0.8. Then either of the following two cases is possible (among an infinite number of possible cases):
$$ \text{Case I}\\ x_1 = 0.5 \qquad\qquad f_1(x) = x\\
x_2 = 0.8 \qquad\qquad f_2(x) = x\\
\text{Case II}\\
x_1 = 1.0 \qquad\qquad f_1(x) = \frac{x}{2}\\ x_2 = 0.8 \qquad\qquad f_2(x) = x. $$
In Case I, we have \(x_1 < x_2\), and in Case II, we have \(x_1 > x_2\). If an algorithm outputs the outcome aligned with one case, then the algorithm will fail in the other case. Indeed, the following theorem shows that no deterministic algorithm based on ratings can ever be strictly better than random guessing.
Theorem 1. Given the scores \(y_1, y_2\) and the assignment, no deterministic algorithm can always perform strictly better than random guessing, under all possible \(x_1, x_2\) and strictly monotonic \(f_1, f_2\).
Let’s try to understand why this is the case. A deterministic algorithm “commits” to an action (deciding which paper has a better quality). It performs well if the situation is aligned with this action. However, due to its prior commitment it may fail if the situation is not aligned. To be more specific, consider the game of rock-paper-scissors. In this game, a deterministic algorithm always loses to an adversary (if the deterministic algorithm plays scissors, then the adversary wins by playing rock, etc.).
The key to solving the problem is randomization — A randomized algorithm can judiciously balance out the good and bad cases. Going back to the example of rock-paper-scissors, consider a randomized algorithm that chooses one of the three actions (rock, paper or scissors) uniformly at random, then it can be formally shown that this randomized algorithm wins 1/3 of the time against the adversary. Given this motivation, we consider the following randomized algorithm for our estimation problem:
Algorithm. Output the paper with the higher score, with probability \(\frac{1 + \lvert y_1-y_2\rvert}{2}\). Otherwise, output the paper with the lower score.
We now show that our randomized algorithm can indeed achieve the desired goal.
Theorem 2. The proposed randomized algorithm succeeds with probability strictly greater than \(0.5\), for any \(x_1, x_2\) and strictly monotonic \(f_1, f_2\).
More generally, let \(g:\mathbb{R} \rightarrow [0, 1]\) be any strictly monotonically increasing function that is anti-symmetric around 0 (that is, \(g(x) = -g(-x)\) for all \(x\in \mathbb{R}\)). Then, Theorem 2 holds for algorithms that output the paper with the higher score with probability \(\frac{1 + g(\lvert y_1 – y_2\rvert)}{2}\). The algorithm mentioned above is a special case using the identity function \(g(u) = u\). The function \(g\) can also take other forms, such as the sigmoid function \(g(u) = \frac{1}{1 + e^{-u}}\).
Using the canonical \(2\times 2\) setting as a building block, we construct algorithms in more general settings such as A/B testing and ranking. See our paper for more details. The paper also includes a discussion on the inspirations and connections to the related work, including Stein’s shrinkage, empirical Bayes, and the two-envelope problem [Cover 1987].
The rest of this section is devoted to giving intuition about Theorem 2 along with a proof sketch.
The key intuition of this result is to exploit the monotonic structure of the calibration functions, whereas this structure is unavailable in ranking data. As we discussed, the randomized algorithm does not make a prior commitment, but instead spreads out its bets on both the good and the bad cases. In this case, because of the monotonic structure of the calibration functions, the probability of the good case (correct estimation) is greater than the probability of the bad case (incorrect estimation) for the randomized algorithm. We now provide a simple proof sketch.
Proof sketch.Without loss of generality, let us assume that \(x_1 < x_2\). Then we consider two cases:
Case I: The scores given by the two reviewers for paper 2 are strictly higher than the scores for paper 1. That is, \(\max\{f_1(x_1), f_2(x_1)\} < \min\{f_1(x_2), f_2(x_2)\}\).
With the random assignment, we observe either \({f_1(x_1), f_2(x_2)}\) or \(\{f_2(x_1), f_1(x_2)\}\). In either assignment, we have \(y_2 > y_1\), and the proposed algorithm succeeds with probability \(\frac{1 + (y_2 – y_1)}{2} > \frac{1}{2}\).
Case II: In at least one of the assignments, the score for paper 2 is lower than or equal to the score for paper 1. Without loss of generality, assume \(f_1(x_1) \ge f_2(x_2)\). Then by the monotonicity of \(f_1, f_2\), we have $$ f_2(x_1) < f_2(x_2) \le f_1(x_1) < f_1(x_2)\qquad \qquad (\star)$$.
We illustrate Equation \((\star)\) pictorially as follows:
With the assignment, we either observe the two blue scores, or the two red scores. In the blue assignment, the algorithm is more likely to conclude that paper 1 is better (bad case). In the red assignment, the algorithm is more likely to conclude that paper 2 is better. The difference \(\lvert {y_1 -y_2} \rvert\) between the two scores is greater in the red assignment. By the construction of the algorithm, it leverages this difference, so that it “succeeds more” in the red assignment than the amount it “loses” in the blue assignment.
More formally, for assignment \(\{f_2(x_1), f_1(x_2)\}\), the algorithm succeeds with probability \(\frac{1 + (f_1(x_2) – f_2(x_1))}{2}\), and for assignment \(\{f_1(x_1), f_2(x_2)\}\), the algorithm succeeds with probability \(\frac{1 – (f_1(x_1) – f_2(x_2))}{2}\). Taking an expectation over the assignment, the overall probability of success is
$$ \frac{1}{2} + \frac{(f_1(x_2) – f_2(x_1)) – (f_1(x_1) – f_2(x_2))}{2} > \frac{1}{2}, $$
because \( f_1(x_2) – f_2(x_1) > f_1(x_1) – f_2(x_2)\) by Equation \((\star)\) (or by the Figure).
\(\square\)
The two key take-aways from our paper are:
(1) Numerical scores contain strictly more information than rankings, even in presence of arbitrary miscalibration. This is in contrast to the folklore belief that under arbitrary miscalibration, the only useful information in ratings is the underlying ranking.
(2) In conference peer review, paper decisions are typically made in a deterministic fashion. However, for papers near the acceptance border, the difference in their scores is small, and could very well be due to issues of calibration of reviewers rather than inherent qualities of the papers. Our work thus suggests that a more fair alternative is to randomize the paper decisions at the border in a randomized fashion like our proposed algorithm in order to account for miscalibration.
Our paper also gives rise to a number of open problems of interest:
(1) Non-adversarial models: In order to analyze the folklore belief, we consider arbitrary miscalibration, and give an algorithm based on ratings that uniformly outperforms algorithms based on rankings. From a practical point of view, it is of interest to model the nature of miscalibration that is not the worst case — something in between over-simplified models for miscalibration and arbitrary miscalibration.
(2) Combining different sources of biases: Miscalibration does not happen in isolation, and indeed other factors do contribute to inaccuracies in terms of paper decisions, such as subjectivity [Noothigattu et al. 2018], strategic behavior [Xu et al. 2018] and noise [Stelmakh et al. 2018]. For example, subjectivity means that people may hold different opinions about the merits of certain papers — what one reviewer thinks is a good paper may look like a mediocre paper from the perspective of another reviewer, and therefore the paper receives different scores from the two reviewers (whereas miscalibration means that even if a paper appears to be identically good to two reviewers, the reviewers may still give different scores due to miscalibration). Combining miscalibration simultaneously with these other factors is a useful and challenging open problem.
DISCLAIMER: All opinions expressed in this posts are those of the author and do not represent the views of Carnegie Mellon University.
A. Ammar and D. Shah. “Efficient rank aggregation using partial data“. SIGMETRICS 2012.
T. Cover. “Pick the Largest Number“. 1987.
Y. Freund, R. Iyer, R. E. Schapire and Y. Singer. “An Efficient Boosting Algorithm for Combining Preferences“. Journal of Machine Learning Research 2003.
I. Mitliagkas, A. Gopalan, C. Caramanis and S. Vishwanath. “User rankings from comparisons: Learning permutations in high dimensions“. Allerton 2011.
R. Noothigattu, N. Shah and A. Procaccia. “Choosing how to choose papers“. ArXiv 2018.
I. Stelmakh, N. Shah and A. Singh. “PeerReview4All: Fair and Accurate Reviewer Assignment in Peer Review“. ALT 2019.
Y. Xu, H. Zhao, X. Shi and N. Shah. “On strategyproof conference review“. ArXiv 2018.
Why did a Deep Neural Network (DNN) make a certain prediction? Although DNNs have been shown to be extremely accurate predictors in a range of domains, they are still largely black-box functions—even to the experts who train them—due to their complicated structure with compositions of multiple layers of nonlinearities. The most popular approach used to shed light on the predictions of DNNs is to create what is known as a saliency map, which provides a relevance score for each feature. While saliency maps may provide insights on what features are important to a DNN, it remains unclear if or how to use this information to improve a given model. One potential solution is to show not only the set of features important to a DNN for some specific prediction, but also the most relevant set of training examples, i.e., prototypes. As we will show, these not only help us understand the predictions of a given DNN, but also provide insights into how to improve the performance of the model.
In our recent paper at NeurIPS 2018, we explain the prediction of a DNN by splitting the output into a sum of contributions from each of the training instances. Before getting into the formal details, here is an illustration of how our approach works for a DNN f and a similarity function K with some precalculated sample importance.
In the figure above, we consider an image classifier trained to determine whether an image is a dog or not, and the classifier is given an image of a dog (left) and a cat (right) at test time. To understand why the model predicted the first image as a dog and the second image as not a dog, we decompose the prediction score for the dog class (0.7 and 0 respectively) into a sum of the weighted similarities between the test image and each training image. This sheds light on which training images are most important for the prediction: the blue box highlights an example with high positive influence (which we call positive prototypes), and the red box highlights one with high negative influence (negative prototypes).
The idea of decomposing a predictor into a linear combination of functions of training points is not new (for interested readers, we refer you to representer theorems when the predictor lies in certain well-behaved spaces of functions). In the following theorem, we provide an analogous decomposition for deep neural networks.
Representer Theorem for Neural Networks: Let us denote the neural network prediction function of some testing input \(x_t\) by \(\hat{y_t} = \sigma(\Phi(x_t, \Theta))\), where \(\Phi(x_t, \Theta) = \Theta_1 f_t\) and \(f_t = \Phi_2(x_t,\Theta_2)\). In simple words, \(\sigma\) is the activation function over the output logit \(\Phi\), and \(\Theta_1\) is the weight of last layer which gets \(f_t\) as the input. Suppose \(\Theta^*\) is a stationary point of the optimization problem: \begin{equation} \arg\min_{\Theta} \{\frac{1}{n}\sum_i^n L(x_i,y_i,\Theta)) + g(||\Theta_1||)\},\end{equation} where \(g(||\Theta_1||) = \lambda ||\Theta_1||^2\) for some \(\lambda >0\). Then we have the decomposition: \begin{equation}\Phi(x_t,\Theta^*) = \sum_i^n \alpha_i k(x_t, x_i),\end{equation} where \(\alpha_{i} = \frac{1}{-2 \lambda n} \frac{\partial L(x_i,y_i,\Theta)}{\partial \Phi(x_i,\Theta)} \), \(k(x_t,x_i) = f_{i}^T f_t, \Theta^*_1 = \sum_i^n \alpha_i f_{i} \), and we call each term in the summation a representer value for \(x_i\) given \(x_t\). Also we will call each \(x_i\) associated with the representer value \(\alpha_i k(x_t, x_i)\) as a representer point. We note that \(\alpha_{i}\) measures the importance of the training instance \(x_i\) on the learned parameter, and thus we call \(\alpha_{i}\) the global sample importance since it is independent of the testing instance.
Our theorem indicates that the predictions of a deep neural network can be decomposed according to the figure below.
Intuition for the Representer Theorem and examples of prototypes: For the representer value \(\alpha_i k(x_t, x_i)\) to be positive, we must have both global sample importance and the feature similarity to have the same sign. For a particular test image, this means that both the test image and training image look similar to each other, and (likely) have the same classification label. Similarly, for this value to be negative, the global sample importance and the feature similarity should have different signs e.g. one is negative and the other is positive. For a particular test image, this means that the images may look similar to each other, but they have different classification labels. Because we have decomposed the activation of the neural network into a sum of these representer values, we say that positive prototypes excite the network, and negative prototypes inhibit the network towards predicting a particular class.
As shown in the above figure, the positive representer points are all from the same class as the test point, and have a similar appearance. On the other hand, negative representer points belong to different classes despite their striking similarity in appearance.
We demonstrate the usefulness of our representer points via two use cases:
Then we wrap up with a discussion of the computational cost associated with our approach.
Why did the model mis-classify certain instances?
We want to use our class of explanations to understand the mistakes made by the model. With a Resnet-50 model trained on the Animals with Attributes (AwA) dataset (Xian et al. 2018), we pick test points with the ground-truth label “Antelope,” and analyze why the model made mistakes on some of these test points. Among 181 test instances labeled “Antelope”, 166 were classified correctly by the model, and 15 were misclassified. Among those 15, 12 were specifically misclassified as “Deer”, just as in the image shown below.
We computed representer points for all 12 of these misclassified test instances, and identified the top negative representer points for the class “Antelope.” Recall from the previous section that the top negative representer points are training points that inhibit the network from predicting “Antelope”, which can be used to make sense of why such inhibition occurred. For all 12 instances, the four representer points shown in the above figure (bottom row) were included among the top 5 negative representer points. Notice that these negative images do contain antelopes but have dataset labels belonging to different classes, like zebra or elephant. When the model is trained on these data points, the label forces the model to focus on just the elephant or zebra and ignore the antelope coexisting in the image. The model thus learns to inhibit the “Antelope” class given an image with small antelopes and other large objects. Hence, the representer points can point back to the errors in the training data that affected the model’s test-time prediction value.
Given a training dataset with corrupted labels, can we correct the dataset? And can we achieve better test accuracy with the corrected dataset?
We consider a scenario where humans need to inspect the dataset quality to ensure an improvement of the model’s performance on the test data. Real-world data is bound to be noisy, and the bigger the dataset becomes, the more difficult it will be for humans to look for and fix mislabeled data points. Consequently, it is crucial to know which data points are more important than others to the model so that we can prioritize data points to inspect and facilitate the debugging process.
We run a simulated experiment where we check a fraction of the training data according to the order set by different importance scores, flip their labels, and retrain the model using the modified training data to observe the improvement of the test accuracy. We also evaluate how quickly different methods can recover and correct wrongly labeled data.
We used a logistic regression model for a binary classification task on the classes automobile vs horse from the CIFAR10 dataset. We used three methods to compute the importance values.
Our method recovers the test accuracy most quickly, and achieves comparable performance on correcting the right data points against the influence functions.
All this is great, but can you compute these explanations quickly?
One advantage of our representer theorem is that it explicitly deconstructs a given deep neural network prediction in terms of representer values, so that we were able to achieve an orders of magnitude speedup compared to influence functions (even with a fine-tuning step that we require where we search for a stationary point, and which the influence function does not). Below shows the time in seconds for both methods to explain one testing instance in two different datasets.
For more details on some theoretical aspects, as well as some additional experiments, please refer to the paper. We also encourage interested readers to try out our code on Github.
DISCLAIMER: All opinions expressed in this post are those of the author and do not represent the views of CMU.
]]>Nowadays most machine learning (ML) models predict labels from features. In classification tasks, an ML model predicts a categorical value and in regression tasks, an ML model predicts a real value. These ML models thus require a large amount of feature-label pairs. While in practice it is not hard to obtain features, it is often costly to obtain labels because this requires human labor.
Can we do more? Can we learn a model without too many feature-label pairs? Think of human learning: as humans, we do not need 1,000 cat images and labels “cat” to learn what is a cat or to differentiate cats from dogs. We can also learn the concept through comparisons. When we see a cat/dog, we can compare it with cats we have seen to decide whether we should label it “cat”.
Our recent papers (1,2) focus on using comparisons to build ML models. The idea of using comparisons is based on a classical psychological observation: It is easier for people to compare between items than evaluate each item alone. For example, what is the age of the man in the image?
Not very easy, right? Is he 20, 30 or 40? We can probably say he is not very old, but it is just hard to be very accurate on the exact age. Now, which person in the two images is older?
Now based on the wrinkles and silver hair, you can probably quickly judge that the second man is older.
This phenomenon is not only present for this task, but also in many other real-world applications. For example, to diagnose patients, it is usually more difficult to directly label each patient with a kind of disease by experimental tests, but easier to compare the physical conditions of two patients. In material synthesis, measuring the characteristics of a material usually requires expensive tests, but comparisons are relatively easy through simulations. For movie ratings, it is often hard for us to give scores for a specific movie, but easier to pick our favorite among a list of movies.
So how can we build ML models using comparisons? Here we describe an approach that uses comparisons to do inferences on the unlabeled samples and feed inferred labels into existing models. Below we will look at two ways for such inference, for classification and regression respectively.
As described above, our setup starts with a set of unlabeled features \(x_1, x_2,…, x_n\), drawn independently and identically distributed (i.i.d.) from a feature distribution \(X\sim P_X\). Let the data dimension be \(d\). Our goal is to learn a function \(f: \mathbb{R}^d \rightarrow \mathcal{Y}\), where \(\mathcal{Y}\) is the label space. For example, for binary classification \(\mathcal{Y}=\{1, -1\}\), and for regression \(\mathcal{Y}=\mathbb{R}\).
We assume we can query either direct labels or pairwise comparisons. The direct label \(Y(x)\) is a (possibly noisy) version of \(f(x)\). The comparison \(Z\) is based on a pair of samples \(x,x’\) and indicates which one of \(x,x’\) can possibly have a larger \(f\) value. For binary classification, this means \(Z\) indicates the more positive sample; for regression, \(Z\) indicates the larger target (e.g., the older people of the pair of images). Our goal is to use as few direct label queries as possible.
Our high-level strategy is to obtain a fully labeled sample pool \(\hat{y}_1,…,\hat{y}_n\), where \(\hat{y}_i\) are either inferred or directly labeled, to feed into a supervised learning algorithm. We will show how such inference can happen, and how the querying process can neatly combine with the learning algorithm for a better performance.
Before we go to the algorithms, we first introduce our workhorse: Ranking from pairwise comparisons. We organize the comparisons to induce a ranking over all the samples. After that, we can do efficient inference with a very small amount of direct labels.
There is a vast amount of literature on ranking from pairwise comparisons, based on different assumptions on the comparison matrix and desired properties. If we have perfect and consistent comparisons, we can use QuickSort (or HeapSort, InsertSort) to rank all \(n\) samples with \(O(n\log n)\) comparisons. If comparisons are noisy and inconsistent, things will be more complicated, but we can still obtain some meaningful rankings. We will not go into more details about ranking since it is out of the scope of this post; we refer interested readers to this survey for more papers on this topic.
Now let’s suppose we have a ranking over all items. We denote it as \(x_1\prec x_2\prec \cdots\prec x_n\), where \(x_i\prec x_j\) means we think \(f(x_i)\leq f(x_j)\). Note that the actual ranking induced by \(f\) might be different from \(x_1\prec x_2\prec \cdots\prec x_n\), as we can have errors in our comparisons.
Now we consider the binary classification problem. If we have a perfect ranking with \begin{align*}f(x_1)\leq f(x_2)\leq \cdots\leq f(x_n),\end{align*} this means the first few samples have labels -1, and then the remaining samples have label +1. Given this specific structure, we would want to find the changing point between negative and positive samples. How are we going to find it?
Binary search! Since the ranking is in order, we just need \(\log n\) direct label queries to figure out the changing point. Note that this has a specific meaning in the context of classification: in the standard supervised learning setting, we need at least \(d\) labels to learn a classifier in \(d\) dimension. Now with this ranking information at hand, we only need to find a threshold in a sequence, which is equivalent to learning a classifier in one dimension. Note that in general the comparison queries are cheaper, so our algorithm can save a lot of cost.
There are a few more things to note for classification. First is about ties: Suppose our task is to differentiate between cats and dogs. If we are given two cat images, it doesn’t really matter how we rank them since we only care about the threshold between positive and negative samples.
Secondly, we can combine our algorithm with active learning to save even more label cost. Many active learning algorithms ask about a batch of samples in each round, and we can use our binary search to label each batch. In more detail, we show in our paper the following theorem:
Theorem (Informal). Suppose each label is correct with probability \(1/2+c\), for a constant \(c\). Then an active learning algorithm would require \(\Omega(d\log(1/\varepsilon))\) direct labels to achieve an error rate of \(\varepsilon\). On the other hand, using binary search on ranking will require \(O(\log(\varepsilon))\) direct labels, and \(O(d\log(d/\varepsilon))\) comparisons.
If we are doing regression, we cannot hope to find a threshold in the ranking, since we need to predict a real number for each label. However, ranking can still help regression through isotonic regression. Given a ranked sequence\begin{align*} f(x_1)\leq f(x_2)\leq\cdots \leq f(x_n) \text{ and } y_i=f(x_i)+\varepsilon_i, \varepsilon_i\sim \mathcal{N}(0,1),\end{align*} the isotonic regression aims to find the solution of
\begin{align*}
\min_{\hat{y}_i} & \sum_{i=1}^n (\hat{y}_i-y_i)^2\\
s.t.& \hat{y}_i\leq \hat{y}_{i+1}, \forall i=1,2,…,n-1.
\end{align*}
If we use \(y_i\) as our labels, the mean-squared error \(\frac{1}{n}\sum_{i=1}^m (y_i-f(x_i))^2\) will have an expectation of 1, since \(\varepsilon_i\sim \mathcal{N}(0,1)\). Isotonic regression enjoys \(m^{-2/3}\) statistical rate, which is diminishing as \(n\rightarrow \infty\). For a reference, see (Zhang, 2002).
The \(m^{-2/3}\) decays faster than the optimal rates of many non-parametric regression problems because it is dimension-independent. Non-parametric methods typically have an error rate of \(m^{-\frac{2}{d+2}}\) given \(m\) labels, the so-called curse of dimensionality (see Tsybakov’s book for an introduction to non-parametric regression). Since the rate of isotonic regression decays much faster than the non-parametric regression problems, we only need a fraction of labels for good accuracy. We leverage this property to design the following algorithm: suppose we only directly query \(m\) labels. While having a ranking over \(n\) points, we can infer the unlabeled samples by just using their nearest labeled points. That is, we query \(y_{t_1},…,y_{t_m}\) and get refined values \(\hat{y}_{t_1},…,\hat{y}_{t_m}\) using the above isotonic regression formulation, we label each point as \(\hat{y}_i=\hat{y}_{t_j}\), where \(i \in 1,…,n\) and \(t_j\) is \(i\)’s nearest neighbor in \(\{t_1,…,t_m\}\).
In our paper, we analyze this algorithm under the non-parametric regression setting. We have the following theorem:
Theorem (Informal). Suppose the underlying function \(f\) is Lipschitz. If we use \(m\) direct labels, any algorithm will incur an error of at least \(\Omega\left(m^{-\frac{2}{d+2}}\right)\). If we use isotonic regression with nearest neighbors, the error will be \(m^{-\frac{2}{3}}+n^{-\frac{2}{d}}\), where \(m\) is the number of direct labels, and \(n\) is the number of ranked points. This rate is optimal for any algorithm using \(m\) direct labels and \(n\) ranked points.
Note the MSE of non-parametric regression using only the labeled samples is \(\Theta(m^{-\frac{2}{d+2}})\) which is exponential in \(d\) and makes non-parametric regression impractical in high-dimensions. Focusing on the dependence on \(m\), our result improves the rate to \(m^{-2/3}\), which is no longer exponential. Therefore, using the ranking information we can avoid the curse of dimensionality.
Now let’s test our algorithm in practice. Our task is to predict the ages of people in images, as aforementioned. We use the APPA-REAL dataset, with 7,113 images and associated ages. The dataset is suitable for comparisons because it contains both the biological age, as well as the apparent age estimated from human labelers. Suppose our goal is to predict the biological age, and we can simulate comparisons by comparing the apparent ages.
Our classification task is to judge whether a person is under or over 30 years old. We compare our method with a base-line active learning method which only uses label queries. Both methods use a linear SVM classifier ( features are extracted from the 128-dimension top layer of FaceNet, an unsupervised method to extract features from faces). The shades represent standard variation over 20 repeats of experiments. The plots show comparisons indeed reduce the number of label queries.
Our regression task is to predict the actual ages, and we compute the mean squared error (MSE) to evaluate different methods. Our label-only baselines are nearest neighbors(NN) methods with 5 or 10 neighbors(5-NN and 10-NN), and support vector regression(SVR). Our methods use 5-NN or 10-NN after we have inferred the labels via isotonic regression. We thus name our methods R\(^2\) 5-NN and R\(^2\) 10-NN. Again, the experiment shows comparisons can reduce the number of label queries.
Of course, binary classification and regression are not the only settings where using comparison information can have a big impact. Using the rank-and-infer approach, we hope to extend these results to multi-class classification, optimization, and reinforcement learning. Feel free to get in touch if you want to learn more!
DISCLAIMER: All opinions expressed in this post are those of the author and do not represent the views of Carnegie Mellon University.
]]>