Over the past decade, artificial intelligence (AI) has achieved remarkable success in many fields such as healthcare, automotive, and marketing. The capabilities of sophisticated, autonomous decision systems driven by AI keep evolving and moving from lab to reality. Many of these systems are **black-box **which means we don’t really understand how they work and why they reach such decisions.

As black-box decision systems have come into greater use, they have also come under greater criticism. One of the main concerns is that it is dangerous to rely on black-box decisions without knowing the way they are made. Here is an example of why they can be dangerous.

Risk-assessment tools have been widely used in the federal and state courts to facilitate and improve judges’ decisions in the criminal justice processes. They provide defendants’ future criminal risk based on socio-economic status, family background, and other factors. In May 2016, ProPublica claimed that one of the most widely used risk-assessment tools, COMPAS, was biased against black defendants while being more generous to white defendants [link]. Northpointe, a for-profit company that provides the software, disputed the analysis but refused to disclose the software’s decision mechanism. So it is not possible for either stakeholders or the public to see what might be actually creating the disparity.

It is dangerous to rely on black-box decisions without knowing the way they are made. Here, we raise a question: How can we possibly go about resolving this concern? Explaining how a black-box decision system works or why it reaches such decisions helps to decide whether or not to follow its decisions. The need for interpretability is especially urgent in fields where black-box decisions can be life-changing and have significant consequences, such as disease diagnosis, criminal justice, and self-driving cars.

What makes a ‘good’ explanation for a black-box? Assume that you give a black-box predictive model an image of an apple. You open the black-box and explain why it believes this is indeed an apple on the image. Simply saying that *“it is red, so this is an apple”* is not sufficient to justify your thought, but you should also avoid redundant explanation. It is important to give enough information **concisely** in explaining a black-box decision system. In other words, explanations should be **brief but comprehensive**.

How can we take into account both briefness and comprehensiveness for explaining a black-box? Our work uses an information theoretic perspective to quantify the idea of briefness and comprehensiveness.

The information bottleneck principle (Tishby et al., 2000) provides an appealing information theoretic view for learning supervised models by defining what we mean by a **‘good’ representation**. The principle says that the optimal model transmits as much information as possible from its input to its output through a compressed representation called the information bottleneck. And the information bottleneck is a good representation that is **maximally informative about the output while compressive about a given input**. Recently, Shwartz-Ziv et al. (2017) and Tishby et al. (2015) showed that the principle also applies to deep neural networks and each layer of the networks can work as an information bottleneck.

We adopt the information bottleneck principle as a criterion for finding a** ‘good’ explanation**. In the information theoretic view, we define a brief but comprehensive explanation as **maximally informative about the black-box decision while compressive about a given input**. In other words, the explanation should maximally compress the mutual information regarding an input while preserving as much as possible mutual information regarding its output.

We introduce the* variational information bottleneck for interpretation (VIBI)*, a **system-agnostic** information bottleneck model that provides a **brief but comprehensive** explanation for every single decision made by a black-box.

VIBI is composed of two parts: **explainer** and **approximator**, each of which is modeled by a deep neural network. Using the information bottleneck principle, VIBI learns an explainer that favors brief explanations while enforcing that the explanations alone suffice for an accurate approximation to the black-box. See the following illustration for an illustration of VIBI.

For each instance, the explainer returns a probability whether a chunk of features, called a *cognitive chunk*, will be selected as an explanation or not. Cognitive chunk is defined as a group of raw features that work as a unit to be explained and whose identity is recognizable to a human, such as a word, phrase, sentence or a group of pixels. The selected chunks act as an information bottleneck that is maximally compressed about input and informative about the decision made by a black-box system on that input.

Now, we formulate the following optimization problem inspired by the information bottleneck principle to learn the explainer and approximator:

$$ p(\mathbf{z} | \mathbf{x}) = \mathrm{argmax}_{p(\mathbf{z} | \mathbf{x}), p(\mathbf{y} | \mathbf{t})} ~~\mathrm{I} ( \mathbf{t}, \mathbf{y} ) – \beta~\mathrm{I} ( \mathbf{x}, \mathbf{t} )$$ where \( \mathrm{I} ( \mathbf{t}, \mathbf{y} ) \) represents the sufficiency of information retained for explaining the black-box output \( \mathbf{y} \), \(-\mathrm{I} ( \mathbf{x}, \mathbf{t} ) \) represents the briefness of the explanation \( \mathbf{t} \), and \( \beta \) is a Lagrange multiplier representing a trade-off between the two.

The current form of information bottleneck objective is intractable due to the mutual informations and the non-differentiable sample \( \mathbf{z} \). We address these challenges as follows.

**Variational Approximation to Information Bottleneck Objective**

The mutual informations \( \mathrm{I} ( \mathbf{t}, \mathbf{y} ) \) and \( \mathrm{I} ( \mathbf{x}, \mathbf{t} ) \) are computationally expensive to quantify (Tishby et al., 2000; Chechik et al., 2005). In order to reduce the computational burden, we use a variational approximation to our information bottleneck objective: $$\mathrm{I} ( \mathbf{t}, \mathbf{y} )~-~\beta~\mathrm{I} ( \mathbf{x}, \mathbf{t} )

\geq \mathbb{E}_{\mathbf{y} \sim p(\mathbf{x})} \mathbb{E}_{\mathbf{y} | \mathbf{x} \sim p(\mathbf{y} | \mathbf{x})} \mathbb{E}_{\mathbf{t} | \mathbf{x} \sim p(\mathbf{t} | \mathbf{x})} \left[ \log q(\mathbf{y} | \mathbf{t}) \right] ~-~\beta~\mathbb{E}_{\mathbf{x}\sim p(\mathbf{x})} \mathrm{KL} (p(\mathbf{z}| \mathbf{x}), r(\mathbf{z})) $$

Now, we can integrate the Kullback-Leibler divergence \( \mathrm{KL} (p(\mathbf{z}| \mathbf{x}), r(\mathbf{z})) \) analytically with proper choices of \( r(\mathbf{z}) \) and \( p(\mathbf{z}|\mathbf{x}) \). We also use the empirical data distribution to approximate \( p(\mathbf{x}, \mathbf{y}) = p(\mathbf{x})p(\mathbf{y}|\mathbf{x}) \).

**Continuous Relaxation and Re-parameterization**

We use the generalized Gumbel-softmax trick (Jang et al., 2017; Chen et al., 2018), which approximates the non-differentiable categorical subset sampling with Gumbel-softmax samples that are differentiable. This trick allows using standard backpropagation to compute the gradients of the parameters via reparameterization.

VIBI provides instance-specific keywords to explain an LSTM sentiment prediction model using Large Movie Review Dataset, IMDB.

The keywords such as “waste,” and “horrible,” are selected for the negative-predicted movie review, while keywords such as “most fascinating,” explain the model’s positive-predicted movie review. Also, we could see that the LSTM sentiment prediction model makes a wrong prediction for a negative review because the review includes several positive words such as ‘enjoyable’ and ‘exciting’.

VIBI also provides instance-specific key patches containing \( 4 \times 4 \) pixels to explain a CNN digit recognition model using the MNIST image dataset.

The first two examples show that the CNN recognizes digits using both shapes and angles. In the first example, the CNN characterizes ‘1’s by straightly aligned patches along with the activated regions although ‘1’s in the left and right panels are written at different angles. Contrary to the first example, the second example shows that the CNN recognizes the difference between ‘9’ and ‘6’ by their differences in angles. The last two examples show that the CNN catches a difference of ‘7’s from ‘1’s by patches located on the activated horizontal line on ‘7’ (see the cyan circle) and recognizes ‘8’s by two patches on the top of the digits and another two patches at the bottom circle.

We assume that a better explanation allows humans to better infer the black-box output given the explanation. Therefore, we asked humans to infer the output of the black-box system (Positive/Negative/Neutral) given five keywords as an explanation generated by VIBI and other competing methods (Saliency, LIME, and L2X). Each method was evaluated by the human intelligences on Amazon Mechanical Turk who are awarded the Masters Qualification (i.e. high-performance workers who have demonstrated excellence across a wide range of tasks). We also evaluated the interpretability for the CNN digit recognition model using MNIST. We asked humans to directly score the explanation on a 0 to 5 scale (0 for no explanation, 1-4 for insufficient or redundant explanation and 5 for concise explanation). Each method was evaluated by 16 graduate students at the School of Computer Science, Carnegie Mellon University who have taken at least one graduate-level machine learning class.

We assessed fidelity of the approximator by prediction performance with respect to the black- box output. We introduce two types of formalized metrics to quantitatively evaluate the fidelity: *approximator fidelity* and *rationale fidelity*.

Approximator fidelity implies the ability of the approximator to imitate the behaviour of a black-box. As shown above, VIBI and L2X outperform the others in approximating the black-box models. However, it does not mean both approximators are same in fidelity. See below.

Rationale fidelity implies how much the selected chunks contribute to the approximator fidelity. As shown above, the selected chunks of VIBI account for more approximator fidelity than L2X. Note that L2X is a special case of VIBI having the information bottleneck trade-off parameter \( \beta = 0 \) (i.e. not using the compressiveness constraint \( −\mathrm{I} ( \mathbf{x}, \mathbf{t} ) \)). Therefore, compressing information through the explainer achieves not only conciseness of explanation but also better fidelity of explanation to a black-box.

Note that the number of cognitive chunks to be selected, \( k \), should be given in advance. It also impacts conciseness of the actual total explanation and should be chosen carefully. In our analysis, we choose \( k \) as the minimum number that exceeds a certain fidelity.

Further details can be found here. The code is publicly available here.

**DISCLAIMER: **All opinions expressed in this post are those of the author and do not represent the views of CMU.

Consider the following problem: we are given a set of items, and the goal is to pick the “best” ones from them. This problem appears very often in real life — for example, selecting papers in conference peer review, judging the winners of a diving competition, picking city construction proposals to allocate funds, etc. In these examples, a common procedure is to assign the items (papers/contestants/proposals) to people (reviewers/judges/citizens) and ask them for their opinion. Then, we aggregate their opinions, and select the best items accordingly. For simplicity, we assume that each item pertains a “true” quality (or “true” value), which is a real number that precisely quantifies how good the item is, and this number is unknown to us. The best items are then the ones with the highest true qualities.

There are a number of sources of biases that may arise when soliciting evaluations from people. In this blog post, we focus on *miscalibration*, which refers to people using different scales when assigning numerical scores. As a running example throughout this blog post, we consider conference peer review. Peer review is a process common in scientific publication. When researchers submit a paper to a conference, the conference organizers will assign the paper to a few “peer reviewers”, who are researchers in the same field, and ask these reviewers to evaluate the quality of the paper. Based on the reviews and comments written by the peer reviewers, the conference organizers make a decision on whether to accept or reject the paper.

It might be the case that some reviewers are lenient and always provide scores in the range [6, 10] whereas some reviewers are more stringent and provide scores in the range [0, 4]. Or it might be the case that one reviewer is moderate whereas the other is extreme — the first reviewer’s 2 is equivalent to the second reviewer’s 1 whereas the first reviewer’s 3 is equivalent to the second reviewer’s 9. Indeed, the issue of miscalibration has been widely noted in the literature:

[Ammar & Shah, 2012]

“The rating scale as well as the individual ratings are often arbitrary and may not be consistent from one user to another.”

[Mitliagkas et al. 2011]

“A raw rating of 7 out of 10 in the absence of any other information is potentially useless.”

So what should we do with the miscalibrated scores we receive? There are two common approaches to address miscalibration. One approach is to make simple assumptions about the nature of miscalibration. For example, in the past, people have assumed that miscalibration is linear. That is, when a reviewer reviews a paper, the score reported by this reviewer will be the true quality of the paper multiplied by a positive scalar, followed by an addition or subtraction of another scalar. However, calibration issues with human-provided scores are often much more complex, and therefore we have not seen much success with these simple models in real conference peer review settings.

The second approach is to use only the ranking of the items. We use “ranking” to refer to the ordering of items. For example, if a reviewer gives scores of 5, 9, and 3 to three papers respectively, then the “ranking” from this reviewer is that the second paper is better than the first paper, and the first paper is better than the third. The ranking can be obtained by sorting the reviewer’s scores of papers (for simplicity, assume there are no ties), or by directly asking the reviewers to rank the papers. In practice, rankings are often used instead of numerical scores. For example, quoting the landmark paper by Freund et al.:

[Freund et al. 2003]

“[Using rankings instead of ratings] becomes very important when we combine the rankings of many viewers who often use completely different ranges of scores to express identical preferences.”

It is a folklore belief that without making any simplifying assumptions on miscalibration, the only useful information is the underlying ranking. In our AAMAS 2019 paper, we examine the fundamental question of whether this folklore belief is true. Concretely, we present theoretical results that contest this belief. We show that, if we use the rating data instead of only the ranking data, we can do strictly better in selecting the best items, even amidst high levels of miscalibration.

For simplicity, let’s first consider the following toy problem: given two papers and two reviewers, we want to select the better paper out of the two. Suppose each reviewer is assigned one paper, and this assignment is done uniformly at random. The two reviewers provide their evaluations (e.g., on a scale from 0 to 10) for the respective paper they review. The reviewers’ rating scales may be miscalibrated. This miscalibration can be arbitrary, and is unknown to us. Since each reviewer only provides a single score, the ranking data collected from each reviewer is vacuous. An algorithm based on rankings can’t really do better than randomly guessing which paper is better. The question we aim to answer is thus: in such a case, can we do strictly better than a random guess, by making use of the scores given by the two reviewers, instead of just their rankings?

Interestingly, as we will explain shortly, the answer turns out to be “yes”. This contests the forklore belief that under arbitrary miscalibration, the only useful information in ratings is the underlying ranking.

To understand the general problem of miscalibration, we first consider a simplified setting, and the key ideas from this setting will be used as a crucial building block for more general algorithms. In this simplified setting, assume that we have two papers with unknown quality values \(x_1, x_2\in \mathbb{R}\), and two reviewers. The two papers are respectively assigned to the two reviewers uniformly at random. That is, paper 1 is assigned to reviewer 1 and paper 2 to reviewer 2 with probability 0.5 (otherwise, paper 1 is assigned to reviewer 2 and paper 2 to reviewer 1). For each reviewer \(i \in \{1, 2\}\), we use a “calibration function” \(f_i: \mathbb{R} \rightarrow \mathbb{R}\) to represent the miscalibration of that reviewer. This function is a mapping from the true quality of a paper, to the score that the reviewer will report for this paper. That is, if the true value of a paper evaluated by reviewer \(i\) is \(x\) then the reviewer will report \(f_i(x)\). For convenience of exposition, we normalize the rating scale such that that the ratings lie in the range of [0, 1], so we have \(f_i : \mathbb{R} \rightarrow [0, 1]\).

We assume that the calibration functions \(f_1\)and \(f_2\) are strictly monotonically increasing. That is, if a reviewer were assigned a paper of higher quality, then the reviewer would give a higher score to that paper (but we don’t know by how much), than to one of lower quality. Other than that, the values \(x_1, x_2\) and functions \(f_1, f_2\) can be arbitrary. Let us denote the reported score for paper 1 (from its assigned reviewer) as \(y_1\), and the reported score for paper 2 (from its assigned reviewer) as \(y_2\). Given the scores \(y_1, y_2\), and the assignment of which paper is assigned to which reviewer, our goal is to tell which paper is better (i.e., infer whether \(x_1 > x_2\) or \(x_1 < x_2\)).

At first, it may seem impossible to extract any useful information from the numerical scores, as the two papers are reviewed by different reviewers, and therefore the scores can be different due to either miscalibration, or differences in the true paper qualities. Say, reviewer 1 is assigned paper 1, and gives a score of 0.5, and reviewer 2 is assigned paper 2 and gives a score of 0.8. Then either of the following two cases is possible (among an infinite number of possible cases):

$$ \text{Case I}\\ x_1 = 0.5 \qquad\qquad f_1(x) = x\\

x_2 = 0.8 \qquad\qquad f_2(x) = x\\

\text{Case II}\\

x_1 = 1.0 \qquad\qquad f_1(x) = \frac{x}{2}\\ x_2 = 0.8 \qquad\qquad f_2(x) = x. $$

In Case I, we have \(x_1 < x_2\), and in Case II, we have \(x_1 > x_2\). If an algorithm outputs the outcome aligned with one case, then the algorithm will fail in the other case. Indeed, the following theorem shows that no *deterministic* algorithm based on ratings can ever be strictly better than random guessing.

**Theorem 1.** Given the scores \(y_1, y_2\) and the assignment, no deterministic algorithm can always perform strictly better than random guessing, under all possible \(x_1, x_2\) and strictly monotonic \(f_1, f_2\).

Let’s try to understand why this is the case. A deterministic algorithm “commits” to an action (deciding which paper has a better quality). It performs well if the situation is aligned with this action. However, due to its prior commitment it may fail if the situation is not aligned. To be more specific, consider the game of rock-paper-scissors. In this game, a deterministic algorithm always loses to an adversary (if the deterministic algorithm plays scissors, then the adversary wins by playing rock, etc.).

The key to solving the problem is *randomization *— A randomized algorithm can judiciously balance out the good and bad cases. Going back to the example of rock-paper-scissors, consider a randomized algorithm that chooses one of the three actions (rock, paper or scissors) uniformly at random, then it can be formally shown that this randomized algorithm wins 1/3 of the time against the adversary. Given this motivation, we consider the following randomized algorithm for our estimation problem:

**Algorithm.** Output the paper with the higher score, with probability \(\frac{1 + \lvert y_1-y_2\rvert}{2}\). Otherwise, output the paper with the lower score.

We now show that our randomized algorithm can indeed achieve the desired goal.

**Theorem 2.** The proposed randomized algorithm succeeds with probability *strictly* greater than \(0.5\), for any \(x_1, x_2\) and strictly monotonic \(f_1, f_2\).

More generally, let \(g:\mathbb{R} \rightarrow [0, 1]\) be any strictly monotonically increasing function that is anti-symmetric around 0 (that is, \(g(x) = -g(-x)\) for all \(x\in \mathbb{R}\)). Then, Theorem 2 holds for algorithms that output the paper with the higher score with probability \(\frac{1 + g(\lvert y_1 – y_2\rvert)}{2}\). The algorithm mentioned above is a special case using the identity function \(g(u) = u\). The function \(g\) can also take other forms, such as the sigmoid function \(g(u) = \frac{1}{1 + e^{-u}}\).

Using the canonical \(2\times 2\) setting as a building block, we construct algorithms in more general settings such as A/B testing and ranking. See our paper for more details. The paper also includes a discussion on the inspirations and connections to the related work, including Stein’s shrinkage, empirical Bayes, and the two-envelope problem [Cover 1987].

The rest of this section is devoted to giving intuition about Theorem 2 along with a proof sketch.

The key intuition of this result is to exploit the monotonic structure of the calibration functions, whereas this structure is unavailable in ranking data. As we discussed, the randomized algorithm does not make a prior commitment, but instead spreads out its bets on both the good and the bad cases. In this case, because of the monotonic structure of the calibration functions, the probability of the good case (correct estimation) is greater than the probability of the bad case (incorrect estimation) for the randomized algorithm. We now provide a simple proof sketch.

**Proof sketch.**Without loss of generality, let us assume that \(x_1 < x_2\). Then we consider two cases:

**Case I:** The scores given by the two reviewers for paper 2 are strictly higher than the scores for paper 1. That is, \(\max\{f_1(x_1), f_2(x_1)\} < \min\{f_1(x_2), f_2(x_2)\}\).

With the random assignment, we observe either \({f_1(x_1), f_2(x_2)}\) or \(\{f_2(x_1), f_1(x_2)\}\). In either assignment, we have \(y_2 > y_1\), and the proposed algorithm succeeds with probability \(\frac{1 + (y_2 – y_1)}{2} > \frac{1}{2}\).

**Case II:** In at least one of the assignments, the score for paper 2 is lower than or equal to the score for paper 1. Without loss of generality, assume \(f_1(x_1) \ge f_2(x_2)\). Then by the monotonicity of \(f_1, f_2\), we have $$ f_2(x_1) < f_2(x_2) \le f_1(x_1) < f_1(x_2)\qquad \qquad (\star)$$.

We illustrate Equation \((\star)\) pictorially as follows:

With the assignment, we either observe the two blue scores, or the two red scores. In the blue assignment, the algorithm is more likely to conclude that paper 1 is better (bad case). In the red assignment, the algorithm is more likely to conclude that paper 2 is better. The difference \(\lvert {y_1 -y_2} \rvert\) between the two scores is greater in the red assignment. By the construction of the algorithm, it leverages this difference, so that it “succeeds more” in the red assignment than the amount it “loses” in the blue assignment.

More formally, for assignment \(\{f_2(x_1), f_1(x_2)\}\), the algorithm succeeds with probability \(\frac{1 + (f_1(x_2) – f_2(x_1))}{2}\), and for assignment \(\{f_1(x_1), f_2(x_2)\}\), the algorithm succeeds with probability \(\frac{1 – (f_1(x_1) – f_2(x_2))}{2}\). Taking an expectation over the assignment, the overall probability of success is

$$ \frac{1}{2} + \frac{(f_1(x_2) – f_2(x_1)) – (f_1(x_1) – f_2(x_2))}{2} > \frac{1}{2}, $$

because \( f_1(x_2) – f_2(x_1) > f_1(x_1) – f_2(x_2)\) by Equation \((\star)\) (or by the Figure).

\(\square\)

The two key take-aways from our paper are:

(1) Numerical scores contain strictly more information than rankings, even in presence of arbitrary miscalibration. This is in contrast to the folklore belief that under arbitrary miscalibration, the only useful information in ratings is the underlying ranking.

(2) In conference peer review, paper decisions are typically made in a deterministic fashion. However, for papers near the acceptance border, the difference in their scores is small, and could very well be due to issues of calibration of reviewers rather than inherent qualities of the papers. Our work thus suggests that a more fair alternative is to randomize the paper decisions at the border in a randomized fashion like our proposed algorithm in order to account for miscalibration.

Our paper also gives rise to a number of open problems of interest:

(1) **Non-adversarial models:** In order to analyze the folklore belief, we consider arbitrary miscalibration, and give an algorithm based on ratings that uniformly outperforms algorithms based on rankings. From a practical point of view, it is of interest to model the nature of miscalibration that is not the worst case — something in between over-simplified models for miscalibration and arbitrary miscalibration.

(2)** Combining different sources of biases:** Miscalibration does not happen in isolation, and indeed other factors do contribute to inaccuracies in terms of paper decisions, such as subjectivity [Noothigattu et al. 2018], strategic behavior [Xu et al. 2018] and noise [Stelmakh et al. 2018]. For example, subjectivity means that people may hold different opinions about the merits of certain papers — what one reviewer thinks is a good paper may look like a mediocre paper from the perspective of another reviewer, and therefore the paper receives different scores from the two reviewers (whereas miscalibration means that even if a paper appears to be identically good to two reviewers, the reviewers may still give different scores due to miscalibration). Combining miscalibration simultaneously with these other factors is a useful and challenging open problem.

**DISCLAIMER:** All opinions expressed in this posts are those of the author and do not represent the views of Carnegie Mellon University.

A. Ammar and D. Shah. “Efficient rank aggregation using partial data“. SIGMETRICS 2012.

T. Cover. “Pick the Largest Number“. 1987.

Y. Freund, R. Iyer, R. E. Schapire and Y. Singer. “An Efficient Boosting Algorithm for Combining Preferences“. Journal of Machine Learning Research 2003.

I. Mitliagkas, A. Gopalan, C. Caramanis and S. Vishwanath. “User rankings from comparisons: Learning permutations in high dimensions“. Allerton 2011.

R. Noothigattu, N. Shah and A. Procaccia. “Choosing how to choose papers“. ArXiv 2018.

I. Stelmakh, N. Shah and A. Singh. “PeerReview4All: Fair and Accurate Reviewer Assignment in Peer Review“. ALT 2019.

Y. Xu, H. Zhao, X. Shi and N. Shah. “On strategyproof conference review“. ArXiv 2018.

Why did a Deep Neural Network (DNN) make a certain prediction? Although DNNs have been shown to be extremely accurate predictors in a range of domains, they are still largely black-box functions—even to the experts who train them—due to their complicated structure with compositions of multiple layers of nonlinearities. The most popular approach used to shed light on the predictions of DNNs is to create what is known as a saliency map, which provides a relevance score for each feature. While saliency maps may provide insights on what features are important to a DNN, it remains unclear if or how to use this information to improve a given model. One potential solution is to show not only the set of features important to a DNN for some specific prediction, but also the most relevant set of *training examples*, i.e., prototypes. As we will show, these not only help us understand the predictions of a given DNN, but also provide insights into how to *improve the performance* of the model.

In our recent paper at NeurIPS 2018, we explain the prediction of a DNN by splitting the output into a sum of contributions from each of the training instances. Before getting into the formal details, here is an illustration of how our approach works for a DNN f and a similarity function K with some precalculated sample importance.

In the figure above, we consider an image classifier trained to determine whether an image is a dog or not, and the classifier is given an image of a dog (left) and a cat (right) at test time. To understand why the model predicted the first image as a dog and the second image as not a dog, we decompose the prediction score for the dog class (0.7 and 0 respectively) into a sum of the weighted similarities between the test image and each training image. This sheds light on which training images are most important for the prediction: the blue box highlights an example with high positive influence (which we call positive prototypes), and the red box highlights one with high negative influence (negative prototypes).

The idea of decomposing a predictor into a linear combination of functions of training points is not new (for interested readers, we refer you to representer theorems when the predictor lies in certain well-behaved spaces of functions). In the following theorem, we provide an analogous decomposition for deep neural networks.

**Representer Theorem for Neural Networks**: Let us denote the neural network prediction function of some testing input \(x_t\) by \(\hat{y_t} = \sigma(\Phi(x_t, \Theta))\), where \(\Phi(x_t, \Theta) = \Theta_1 f_t\) and \(f_t = \Phi_2(x_t,\Theta_2)\). In simple words, \(\sigma\) is the activation function over the output logit \(\Phi\), and \(\Theta_1\) is the weight of last layer which gets \(f_t\) as the input. Suppose \(\Theta^*\) is a stationary point of the optimization problem: \begin{equation} \arg\min_{\Theta} \{\frac{1}{n}\sum_i^n L(x_i,y_i,\Theta)) + g(||\Theta_1||)\},\end{equation} where \(g(||\Theta_1||) = \lambda ||\Theta_1||^2\) for some \(\lambda >0\). Then we have the decomposition: \begin{equation}\Phi(x_t,\Theta^*) = \sum_i^n \alpha_i k(x_t, x_i),\end{equation} where \(\alpha_{i} = \frac{1}{-2 \lambda n} \frac{\partial L(x_i,y_i,\Theta)}{\partial \Phi(x_i,\Theta)} \), \(k(x_t,x_i) = f_{i}^T f_t, \Theta^*_1 = \sum_i^n \alpha_i f_{i} \), and we call each term in the summation a **representer value** for \(x_i\) given \(x_t\). Also we will call each \(x_i\) associated with the representer value \(\alpha_i k(x_t, x_i)\) as a **representer point**. We note that \(\alpha_{i}\) measures the importance of the training instance \(x_i\) on the learned parameter, and thus we call \(\alpha_{i}\) the **global sample importance** since it is independent of the testing instance.

Our theorem indicates that the predictions of a deep neural network can be decomposed according to the figure below.

**Intuition for the Representer Theorem and examples of prototypes**: For the representer value \(\alpha_i k(x_t, x_i)\) to be positive, we must have both global sample importance and the feature similarity to have the same sign. For a particular test image, this means that both the test image and training image look similar to each other, and (likely) have the same classification label. Similarly, for this value to be negative, the global sample importance and the feature similarity should have different signs e.g. one is negative *and* the other is positive. For a particular test image, this means that the images may look similar to each other, but they have different classification labels. Because we have decomposed the activation of the neural network into a sum of these representer values, we say that positive prototypes *excite* the network, and negative prototypes *inhibit* the network towards predicting a particular class.

As shown in the above figure, the positive representer points are all from the same class as the test point, and have a similar appearance. On the other hand, negative representer points belong to different classes despite their striking similarity in appearance.

We demonstrate the usefulness of our representer points via two use cases:

- Misclassification Analysis
- Dataset Debugging

Then we wrap up with a discussion of the computational cost associated with our approach.

*Why did the model mis-classify certain instances?*

We want to use our class of explanations to understand the mistakes made by the model. With a Resnet-50 model trained on the Animals with Attributes (AwA) dataset (Xian et al. 2018), we pick test points with the ground-truth label “Antelope,” and analyze why the model made mistakes on some of these test points. Among 181 test instances labeled “Antelope”, 166 were classified correctly by the model, and 15 were misclassified. Among those 15, 12 were specifically misclassified as “Deer”, just as in the image shown below.

We computed representer points for all 12 of these misclassified test instances, and identified the top *negative* representer points for the class “Antelope.” Recall from the previous section that the top negative representer points are training points that *inhibit* the network from predicting “Antelope”, which can be used to make sense of why such inhibition occurred. For all 12 instances, the four representer points shown in the above figure (bottom row) were included among the top 5 negative representer points. Notice that these negative images do contain antelopes but have dataset labels belonging to different classes, like zebra or elephant. When the model is trained on these data points, the label forces the model to focus on just the elephant or zebra and ignore the antelope coexisting in the image. The model thus learns to inhibit the “Antelope” class given an image with small antelopes and other large objects. Hence, the representer points can point back to the errors in the training data that affected the model’s test-time prediction value.

*Given a training dataset with corrupted labels, can we correct the dataset? And can we achieve better test accuracy with the corrected dataset?*

We consider a scenario where humans need to inspect the dataset quality to ensure an improvement of the model’s performance on the test data. Real-world data is bound to be noisy, and the bigger the dataset becomes, the more difficult it will be for humans to look for and fix mislabeled data points. Consequently, it is crucial to know which data points are more important than others to the model so that we can prioritize data points to inspect and facilitate the debugging process.

We run a simulated experiment where we check a fraction of the training data according to the order set by different importance scores, flip their labels, and retrain the model using the modified training data to observe the improvement of the test accuracy. We also evaluate how quickly different methods can recover and correct wrongly labeled data.

We used a logistic regression model for a binary classification task on the classes automobile vs horse from the CIFAR10 dataset. We used three methods to compute the importance values.

- Random (green line): randomly select the training point to fix.
- Influence function (blue line): select the training point with largest influence function value (Koh et al. 2017).
- Representer values (red line): select the training point with largest absolute global importance.

Our method recovers the test accuracy most quickly, and achieves comparable performance on correcting the right data points against the influence functions.

*All this is great, but can you compute these explanations quickly?*

One advantage of our representer theorem is that it explicitly deconstructs a given deep neural network prediction in terms of representer values, so that we were able to achieve an orders of magnitude speedup compared to influence functions (even with a fine-tuning step that we require where we search for a stationary point, and which the influence function does not). Below shows the time in seconds for both methods to explain one testing instance in two different datasets.

For more details on some theoretical aspects, as well as some additional experiments, please refer to the paper. We also encourage interested readers to try out our code on Github.

**DISCLAIMER:** All opinions expressed in this post are those of the author and do not represent the views of CMU.

Nowadays most machine learning (ML) models predict labels from features. In classification tasks, an ML model predicts a categorical value and in regression tasks, an ML model predicts a real value. These ML models thus require a large amount of feature-label pairs. While in practice it is not hard to obtain features, it is often costly to obtain labels because this requires human labor.

Can we do more? Can we learn a model *without* too many feature-label pairs? Think of human learning: as humans, we do not need 1,000 cat images and labels “cat” to learn what is a cat or to differentiate cats from dogs. We can also learn the concept through comparisons. When we see a cat/dog, we can compare it with cats we have seen to decide whether we should label it “cat”.

Our recent papers (1,2) focus on using comparisons to build ML models. The idea of using comparisons is based on a classical psychological observation: It is easier for people to compare between items than evaluate each item alone. For example, what is the age of the man in the image?

Not very easy, right? Is he 20, 30 or 40? We can probably say he is not very old, but it is just hard to be very accurate on the exact age. Now, which person in the two images is older?

Now based on the wrinkles and silver hair, you can probably quickly judge that the second man is older.

This phenomenon is not only present for this task, but also in many other real-world applications. For example, to diagnose patients, it is usually more difficult to directly label each patient with a kind of disease by experimental tests, but easier to compare the physical conditions of two patients. In material synthesis, measuring the characteristics of a material usually requires expensive tests, but comparisons are relatively easy through simulations. For movie ratings, it is often hard for us to give scores for a specific movie, but easier to pick our favorite among a list of movies.

So how can we build ML models using comparisons? Here we describe an approach that uses comparisons to do inferences on the unlabeled samples and feed inferred labels into existing models. Below we will look at two ways for such inference, for classification and regression respectively.

As described above, our setup starts with a set of unlabeled features \(x_1, x_2,…, x_n\), drawn independently and identically distributed (i.i.d.) from a feature distribution \(X\sim P_X\). Let the data dimension be \(d\). Our goal is to learn a function \(f: \mathbb{R}^d \rightarrow \mathcal{Y}\), where \(\mathcal{Y}\) is the label space. For example, for binary classification \(\mathcal{Y}=\{1, -1\}\), and for regression \(\mathcal{Y}=\mathbb{R}\).

We assume we can query either direct labels or pairwise comparisons. The direct label \(Y(x)\) is a (possibly noisy) version of \(f(x)\). The comparison \(Z\) is based on a pair of samples \(x,x’\) and indicates which one of \(x,x’\) can possibly have a larger \(f\) value. For binary classification, this means \(Z\) indicates the more positive sample; for regression, \(Z\) indicates the larger target (e.g., the older people of the pair of images). Our goal is to use as few direct label queries as possible.

Our high-level strategy is to obtain a fully labeled sample pool \(\hat{y}_1,…,\hat{y}_n\), where \(\hat{y}_i\) are either inferred or directly labeled, to feed into a supervised learning algorithm. We will show how such inference can happen, and how the querying process can neatly combine with the learning algorithm for a better performance.

Before we go to the algorithms, we first introduce our workhorse: Ranking from pairwise comparisons. We organize the comparisons to induce a ranking over all the samples. After that, we can do efficient inference with a very small amount of direct labels.

There is a vast amount of literature on ranking from pairwise comparisons, based on different assumptions on the comparison matrix and desired properties. If we have perfect and consistent comparisons, we can use QuickSort (or HeapSort, InsertSort) to rank all \(n\) samples with \(O(n\log n)\) comparisons. If comparisons are noisy and inconsistent, things will be more complicated, but we can still obtain some meaningful rankings. We will not go into more details about ranking since it is out of the scope of this post; we refer interested readers to this survey for more papers on this topic.

Now let’s suppose we have a ranking over all items. We denote it as \(x_1\prec x_2\prec \cdots\prec x_n\), where \(x_i\prec x_j\) means we think \(f(x_i)\leq f(x_j)\). Note that the actual ranking induced by \(f\) might be different from \(x_1\prec x_2\prec \cdots\prec x_n\), as we can have errors in our comparisons.

Now we consider the binary classification problem. If we have a perfect ranking with \begin{align*}f(x_1)\leq f(x_2)\leq \cdots\leq f(x_n),\end{align*} this means the first few samples have labels -1, and then the remaining samples have label +1. Given this specific structure, we would want to find the *changing point* between negative and positive samples. How are we going to find it?

Binary search! Since the ranking is in order, we just need \(\log n\) direct label queries to figure out the changing point. Note that this has a specific meaning in the context of classification: in the standard supervised learning setting, we need at least \(d\) labels to learn a classifier in \(d\) dimension. Now with this ranking information at hand, we only need to find a threshold in a sequence, which is equivalent to learning a classifier in one dimension. Note that in general the comparison queries are cheaper, so our algorithm can save a lot of cost.

There are a few more things to note for classification. First is about ties: Suppose our task is to differentiate between cats and dogs. If we are given two cat images, it doesn’t really matter how we rank them since we only care about the threshold between positive and negative samples.

Secondly, we can combine our algorithm with active learning to save even more label cost. Many active learning algorithms ask about a batch of samples in each round, and we can use our binary search to label each batch. In more detail, we show in our paper the following theorem:

**Theorem (Informal). **Suppose each label is correct with probability \(1/2+c\), for a constant \(c\). Then an active learning algorithm would require \(\Omega(d\log(1/\varepsilon))\) direct labels to achieve an error rate of \(\varepsilon\). On the other hand, using binary search on ranking will require \(O(\log(\varepsilon))\) direct labels, and \(O(d\log(d/\varepsilon))\) comparisons.

If we are doing regression, we cannot hope to find a threshold in the ranking, since we need to predict a real number for each label. However, ranking can still help regression through isotonic regression. Given a ranked sequence\begin{align*} f(x_1)\leq f(x_2)\leq\cdots \leq f(x_n) \text{ and } y_i=f(x_i)+\varepsilon_i, \varepsilon_i\sim \mathcal{N}(0,1),\end{align*} the isotonic regression aims to find the solution of

\begin{align*}

\min_{\hat{y}_i} & \sum_{i=1}^n (\hat{y}_i-y_i)^2\\

s.t.& \hat{y}_i\leq \hat{y}_{i+1}, \forall i=1,2,…,n-1.

\end{align*}

If we use \(y_i\) as our labels, the mean-squared error \(\frac{1}{n}\sum_{i=1}^m (y_i-f(x_i))^2\) will have an expectation of 1, since \(\varepsilon_i\sim \mathcal{N}(0,1)\). Isotonic regression enjoys \(m^{-2/3}\) statistical rate, which is diminishing as \(n\rightarrow \infty\). For a reference, see (Zhang, 2002).

The \(m^{-2/3}\) decays faster than the optimal rates of many non-parametric regression problems because it is dimension-independent. Non-parametric methods typically have an error rate of \(m^{-\frac{2}{d+2}}\) given \(m\) labels, the so-called curse of dimensionality (see Tsybakov’s book for an introduction to non-parametric regression). Since the rate of isotonic regression decays much faster than the non-parametric regression problems, we only need a fraction of labels for good accuracy. We leverage this property to design the following algorithm: suppose we only directly query \(m\) labels. While having a ranking over \(n\) points, we can infer the unlabeled samples by just using their nearest labeled points. That is, we query \(y_{t_1},…,y_{t_m}\) and get refined values \(\hat{y}_{t_1},…,\hat{y}_{t_m}\) using the above isotonic regression formulation, we label each point as \(\hat{y}_i=\hat{y}_{t_j}\), where \(i \in 1,…,n\) and \(t_j\) is \(i\)’s nearest neighbor in \(\{t_1,…,t_m\}\).

In our paper, we analyze this algorithm under the non-parametric regression setting. We have the following theorem:

**Theorem (Informal). **Suppose the underlying function \(f\) is Lipschitz. If we use \(m\) direct labels, any algorithm will incur an error of at least \(\Omega\left(m^{-\frac{2}{d+2}}\right)\). If we use isotonic regression with nearest neighbors, the error will be \(m^{-\frac{2}{3}}+n^{-\frac{2}{d}}\), where \(m\) is the number of direct labels, and \(n\) is the number of ranked points. This rate is optimal for any algorithm using \(m\) direct labels and \(n\) ranked points.

Note the MSE of non-parametric regression using only the labeled samples is \(\Theta(m^{-\frac{2}{d+2}})\) which is exponential in \(d\) and makes non-parametric regression impractical in high-dimensions. Focusing on the dependence on \(m\), our result improves the rate to \(m^{-2/3}\), which is no longer exponential. Therefore, using the ranking information we can avoid the curse of dimensionality.

Now let’s test our algorithm in practice. Our task is to predict the ages of people in images, as aforementioned. We use the APPA-REAL dataset, with 7,113 images and associated ages. The dataset is suitable for comparisons because it contains both the biological age, as well as the apparent age estimated from human labelers. Suppose our goal is to predict the biological age, and we can simulate comparisons by comparing the apparent ages.

Our classification task is to judge whether a person is under or over 30 years old. We compare our method with a base-line active learning method which only uses label queries. Both methods use a linear SVM classifier ( features are extracted from the 128-dimension top layer of FaceNet, an unsupervised method to extract features from faces). The shades represent standard variation over 20 repeats of experiments. The plots show comparisons indeed reduce the number of label queries.

Our regression task is to predict the actual ages, and we compute the mean squared error (MSE) to evaluate different methods. Our label-only baselines are nearest neighbors(NN) methods with 5 or 10 neighbors(5-NN and 10-NN), and support vector regression(SVR). Our methods use 5-NN or 10-NN after we have inferred the labels via isotonic regression. We thus name our methods R\(^2\) 5-NN and R\(^2\) 10-NN. Again, the experiment shows comparisons can reduce the number of label queries.

Of course, binary classification and regression are not the only settings where using comparison information can have a big impact. Using the rank-and-infer approach, we hope to extend these results to multi-class classification, optimization, and reinforcement learning. Feel free to get in touch if you want to learn more!

**DISCLAIMER:** All opinions expressed in this post are those of the author and do not represent the views of Carnegie Mellon University.

Figure 1: A motivating figure for this blog post. The plot on the left shows that increasing the amount of regularization used when running ridge regression (corresponding to moving up higher on the y-axis) implies that ridge begins to tune out the directions in the data that have lower variation (i.e., we see more dark green as we scan from right to left). Meanwhile, the plot on the right actually shows very similar behavior, but this time for a very different estimator: gradient descent when run on the least-squares loss, as we terminate it earlier and earlier (i.e., as we increasingly stop gradient descent far short of when it converges, given again by moving higher up on the y-axis). This raises the question: are these similarities just a coincidence? Or is there something deeper going on here, linking together early-stopped gradient descent to ridge regression? We’ll try to address this question, in this blog post. (By the way, many of the other details in the plot, e.g., the labels “Grad Flow”, “lambda”, and “1/t” will all be explained, later on in the post.)

These days, it seems as though tools from statistics and machine learning are being used almost everywhere, with methods from the literature now seen on the critical path in a number of different fields and industries. A consequence of this sort of proliferation is that, increasingly, both non-specialists and specialists alike are being asked to deploy statistical models into the wild. This often makes “simple” methods a natural choice, with “simple” usually meaning (a) “easy-to-implement” and/or (b) “computationally cheap”.

That being the case, what could be simpler than gradient descent? Gradient descent is often quite easy to implement and computationally affordable, which probably explains (at least some of) its popularity. One thing people tend to do with gradient descent is to “stop it early”, by which I mean: people tend to *not* run gradient descent until convergence. Why not? Well, it has been long observed (by many, e.g., here, here, and here) that early-stopping gradient descent has a kind of regularizing effect—even if the loss function that gradient descent is run on has no *explicit* regularizer. In fact, it has been suggested (see, e.g., here and here) that the *implicit regularization* properties of optimization algorithms may explain at least in part some of the recent successes of deep neural networks in practice, making implicit regularization a very lively and growing area of research right now.

In our recent paper, we precisely characterize the implicit regularization effect of early-stopping gradient descent, when run on the least-squares regression problem. Apart from being interesting in its own right, the hope is that this sort of characterization will also be useful to practitioners.

(By the way, in case you think that least-squares is an elementary and boring problem: I would counter that and say that least-squares turns out to actually be quite an interesting problem to study, and moreover it represents a good starting point before moving onto more exotic problems. In any event, I’ll mention ideas for future work at the end of this post!)

To fix ideas, here is the standard least-squares regression problem:

\begin{equation}

\tag{1}

\hat \beta^{\textrm{ls}} \in \underset{\beta \in \mathbb{R}^p}{\mathrm{argmin}} \; \frac{1}{2n} \|y-X\beta\|_2^2.

\end{equation}

To be clear, \(y \in \mathbb{R}^n\) is the response, \(X \in \mathbb{R}^{n \times p}\) is the data matrix, \(n\) denotes the number of samples, and \(p\) the number of features. Running gradient descent on the problem (1) just amounts to

\begin{equation}

\tag{2}

\beta^{(k)} = \beta^{(k-1)} + \epsilon \cdot \frac{X^T}{n} (y – X \beta^{(k-1)}),

\end{equation}

where \(k=1,2,3,\ldots\) is an iteration counter, and \(\epsilon > 0\) is a fixed step size. At this point, it helps to also consider ridge regression:

\begin{equation}

\tag{3}

\hat \beta^{\textrm{ridge}}(\lambda) = \underset{\beta \in \mathbb{R}^p}{\mathrm{argmin}} \; \frac{1}{n} \|y-X\beta\|_2^2 + \lambda

\|\beta\|_2^2,

\end{equation}

where \(\lambda \geq 0\) is a tuning parameter.

It might already be intuitively obvious to you that (a) running (2) until convergence is equivalent to running (3) with \(\lambda = 0\); and (b) assuming the initialization \(\beta^{(0)} = 0\), not taking any steps in (2) just yields the null model, equivalent to running (3) with \(\lambda \to \infty\). Put differently, it seems as though running gradient descent for * longer* corresponds to

Here, we are plotting the *estimation risk* (defined next) of ridge regression and (essentially) gradient descent on the y-axis, vs. the ridge regularization strength \(\lambda = 1/t\) (or, equivalently, the inverse of the number of gradient descent iterations) on the x-axis; the data was generated by drawing samples from a normal distribution, but similar results hold for other distributions as well. I’ll explain the reason for the “essentially” in just a minute.

First of all, by the *estimation risk* of an estimator \(\hat \beta\), I simply mean

\begin{equation}

\mathrm{Risk}(\hat \beta;\beta_0) := \mathbb{E} \| \hat \beta – \beta_0 \|_2^2,

\end{equation}

where here we are taking an expectation over the randomness in the response \(y\), which is assumed to follow a (parametric) distribution with mean \(X \beta_0\) and covariance \(\sigma^2 I\), where \(\sigma^2 > 0\). For now, we fix both the underlying coefficients \(\beta_0 \in \mathbb{R}^p\) as well as the data \(X\); we will allow both to be random, later on.

Back to the plot: in solid black and red, we plot the estimation risk of ridge regression and gradient descent, respectively, while we plot their limiting (i.e., large \(n,p\)) estimation risks in dashed black and red. What should be clear for now is that, aside from agreeing at their endpoints, the black and red curves are * very* close everywhere else, as well. In fact, it appears as though the risk of gradient descent cannot be more than, say, 1.2 times that of ridge regression.

Actually, maybe a second picture can also be of some help here. Below, we plot the same risks as we did above, but now the x-axis is an estimator’s achieved model complexity, as measured by its \(\ell_2\)-norm; the point of this second plot is that the risk curves are now seen to be ** virtually identical**. So, hopefully that reinforces the point from earlier. Unfortunately, studying this sort of setup turns out to be a little complicated, so we’ll stick to the setup corresponding to the first plot, for the rest of this post.

The reason for the “essentially” parenthetical above was that in our paper we actually study gradient descent *with infinitesimally small step sizes*, which is often called *gradient flow *(this is why “Grad flow” appears in the legend of the plot, above). In contrast to the gradient descent iteration (2), gradient flow for (1) can be expressed as

\begin{equation}

\tag{4}

\hat \beta^{\textrm{gf}}(t) = (X^T X)^+ (I – \exp(-t X^T X/n)) X^T y,

\end{equation}

where \(A^+\) denotes the Moore-Penrose pseudo-inverse, \(\exp(A)\) denotes the matrix exponential, and \(t\) denotes “time”. This turns out to be the key to the analysis (and standard tools from numerical analysis can connect results obtained for (4) to those for (2)). Here is a simplified version of one result from our paper (i.e., Theorem 1) that follows rather immediately after taking the continuous-time viewpoint.

**Theorem (simplified). **Under the conditions given above, for any underlying coefficients \(\beta_0 \in \mathbb{R}^p\) and times \(t \geq 0\), we have that \(\mathrm{Risk}(\hat \beta^{\textrm{gf}}(t);\beta_0) \leq 1.6862 \cdot \mathrm{Risk}(\hat \beta^{\textrm{ridge}}(1/t);\beta_0)\).

In words, the estimation risk of gradient flow is no more than 1.6862 times that of ridge, at any point along their paths, provided we “line them up” by taking the ridge regularization strength as \(\lambda = 1/t\) (which seems fairly natural, as discussed above). The constant of 1.6862 also seems to match up pretty well with the constant of 1.2 that we noticed empirically above.

To be completely clear, we are certainly not the first authors to relate gradient descent and \(\ell_2\)-regularization (see our paper for a mention of related work), but we are not aware of any other kind of result that does so with this level of specificity as well as generality; more on this latter aspect after a proof of the result.

**Proof (sketch).** To show the result, we need two simple facts: first, for \(x \geq 0\), it holds that (a) \(\exp(-x) \leq 1/(1+x)\); and second (b) \(1-\exp(-x) \leq 1.2985 \cdot x/(1+x)\). We also need a little bit of notation: let \(s_i, v_i, \; i=1,\ldots,p\) denote the eigenvalues and eigenvectors, respectively, of \((1/n) X^T X\). Owing to the continuous-time representation, it turns out the estimation risk of gradient flow can be written as \(\mathrm{Risk}(\hat \beta^{\textrm{gf}}(t);\beta_0) = \sum_{i=1}^p a_i\), where

\begin{equation*}

a_i = |v_i^T \beta_0|^2 \exp(-2 t s_i) +

\frac{\sigma^2}{n} \frac{(1 – \exp(-t s_i))^2}{s_i},

\end{equation*}

while that of ridge regression can be written as \(\mathrm{Risk}(\hat \beta^{\textrm{ridge}}(\lambda);\beta_0) = \sum_{i=1}^p b_i\), where

\begin{equation*}

b_i = |v_i^T \beta_0|^2 \frac{\lambda^2}{(s_i + \lambda)^2} +

\frac{\sigma^2}{n} \frac{s_i}{(s_i + \lambda)^2}.

\end{equation*}

Now, using facts (a) and (b) (as well as a change of variables), we get that

\begin{align*}

a_i &\leq |v_i^T \beta_0|^2 \frac{1}{(1 + t s_i)^2} +

\frac{\sigma^2}{n} 1.2985^2 \frac{t^2 s_i}{(1 + t s_i)^2} \\

&\leq 1.6862 \bigg(|v_i^T \beta_0|^2 \frac{(1/t)^2}{(1/t + s_i)^2} +

\frac{\sigma^2}{n} \frac{s_i}{(1/t + s_i)^2} \bigg) \\

&= 1.6862 \, b_i,

\end{align*}

where, again, we used \(\lambda = 1/t\). Summing over \(i=1,\ldots,p\) in the above gives the result.

**Going (much) further**

Earlier, I hinted that it was possible to generalize the result presented above. In fact, in our paper, we show how to obtain the same result for various other notions of risk:

- Bayes estimation risk (i.e., where \(\beta_0\) is assumed to follow some prior)
- In-sample (Bayes) prediction risk (i.e., \( (1/n) \mathbb{E} \| X \hat \beta – X \beta_0 \|_2^2 \))
- Out-of-sample Bayes prediction risk (i.e., \( \mathbb{E}[(x_0^T \hat \beta – x_0^T \beta_0)^2] \), where \(x_0\) is a test point).

See (the rest of) Theorem 1 as well as Theorem 2 in the paper, for details. Finally, it is also possible to obtain even tighter results in all these settings when we examine the risks of optimally-tuned ridge and gradient flow; see Theorem 3.

To finish up, I’ll (*very*) briefly mention some of the insights that we get by examining the above issues from an asymptotic viewpoint. In a really nice piece of recent work, it was shown that under a high-dimensional asymptotic setup with \(p,n \to \infty\) such that \(p/n \to \gamma \in (0,\infty)\), the out-of-sample Bayes prediction risk of ridge regression converges to the expression

\begin{equation}

\tag{5}

\sigma^2 \gamma \big[

\theta(\lambda) + \lambda (1 – \alpha_0 \lambda) \theta'(\lambda) \big],

\end{equation}

almost surely, for each \(\lambda > 0\), where \(\alpha_0\) is a fixed constant depending on the variance of the prior on the coefficients. Also, \(\theta(\lambda)\) is a functional (whose exact form is not important for this brief description) that depends on \(\sigma^2, \lambda, \gamma\), as well as the limiting distribution of the eigenvalues of the sample covariance matrix.

Where I think things get really interesting is that it turns out for gradient flow, the analogous risk has the following almost sure limit (see Theorem 6 in our paper for details):

\begin{equation}

\tag{6}

\sigma^2 \gamma \bigg[ \alpha_0 \mathcal L^{-1}(\theta)(2t) + 2 \int_0^t \big( \mathcal L^{-1}(\theta)(u) – \mathcal L^{-1}(\theta)(2u) \big) \, du\bigg],

\end{equation}

where \(\mathcal L^{-1}(\theta)\) denotes the inverse Laplace transform of \(\theta\). In a nice bit of symmetry, note that (5) features the functional \(\theta\) and its derivative, whereas (6) features the inverse Laplace transform of the functional \(\mathcal L^{-1}(\theta)\) and its antiderivative. Moreover, note that (6) is an asymptotically *exact* risk expression, i.e., there are no hidden constants or anything like that. The key to the analysis here turns out to be applying recent tools from random matrix theory to (functionals of) the sample covariance matrix \((1/n) X^T X\). Again, that was a very brief sketch of the result … but hopefully it at least got across some (more) of the interesting duality between ridge regression and gradient descent.

If you found this post interesting, that’s great! There is a number of other connections between ridge regression and gradient descent that we point out in the paper, which I didn’t have the space to get into here. There are also plenty of places to go next, including studying stochastic gradient descent and moving beyond least-squares (both of which we are doing right now). Feel free to get in touch if you would like to learn more.

**DISCLAIMER:** All opinions expressed in this posts are those of the author and do not represent the views of Carnegie Mellon University.

Figure 1: Overview of the contextual parameter generator that is introduced in this post. The top part of the figure shows a typical neural machine translation system (consisting of an encoder and a decoder network). The bottom part, shown in red, shows our parameter generator component.

Machine translation is the problem of translating sentences from some source language to a target language. Neural machine translation (NMT) directly models the mapping of a source language to a target language without any need for training or tuning any component of the system separately. This has led to a rapid progress in NMT and its successful adoption in many large-scale settings.

NMT systems typically consist of two main components: the *encoder* and the *decoder*. The encoder takes as input the source language sentence and generates some latent representation for it (e.g., a fixed-size vector). The decoder then takes that latent representation as input and generates a sentence in the target language. The sentence generation is often done in an autoregressive manner (i.e., words are generated one-by-one in a left-to-right manner).

It is easy to see how this kind of architecture can be used to translate from a single source language to a single target language. However, translating between arbitrary pairs of languages is not as simple. This problem is referred to as *multilingual machine translation*, and it is the problem we tackle. Currently, there exist three approaches for multilingual NMT.

Assuming that we have \(L\) languages and \(P\) trainable parameters per NMT model (these could be, for example, the weights and biases of a recurrent neural network) that translates from a single source language to a single target language, we have:

**Pairwise:**Use a pairwise NMT model per language pair. This results in \(L^2\) separate models, each with \(P\) parameters. The main issue with this approach is that no information is shared between different languages, and even between models translating from English to other languages, for example. This is especially problematic for low-resource languages, where we have very little training data available, and would ideally want to leverage the high availability of data for other languages, such as English.**Universal:**Use a single NMT model for all language pairs. In this case, an artificial word can be added to the source sentence denoting the target language for the translation. This results in a single model with \(P\) parameters, since the whole model is shared across all languages. This approach can result in overfitting to the high-resource languages. It can also limit the expressivity of the translation model for languages that are more “different” (e.g., for Turkish, if training using Italian, Spanish, Romanian, and Turkish). Ideally, we want something in-between pairwise and universal models.**Per-Language Encoder/Decoder:**Use a separate encoder and a separate decoder for each language. This results in a total of \(LP\) parameters and lies somewhere between the pairwise and the universal approaches. However, no information can be shared between languages that are similar (e.g., Italian and Spanish).

At EMNLP 2018, we introduced the *contextual parameter generator (CPG)*, a new way to share information across different languages that generalizes the above three approaches, while mitigating their issues and allowing explicit control over the amount of sharing.

Let us denote the source language for a given sentence pair by \(\ell_s\) and the target language by \(\ell_t\). When using the contextual parameter generator, the parameters of the encoder are defined as \(\theta^{(enc)}\triangleq g^{(enc)}({\bf l}_s)\), for some function \(g^{(enc)}\), where \({\bf l}_s\) denotes a language embedding for the source language \(\ell_s\). Similarly, the parameters of the decoder are defined as \(\theta^{(dec)}\triangleq g^{(dec)}({\bf l}_t)\) for some function \(g^{(dec)}\), where \({\bf l}_t\) denotes a language embedding for the target language \(\ell_t\). Our general formulation does not impose any constraints on the functional form of \(g^{(enc)}\) and \(g^{(dec)}\). In this case, you can think of the source language, \(\ell_s\), as a context for the encoder. The parameters of the encoder depend on its context, but its architecture is common across all contexts. We can make a similar argument for the decoder, and that is where the name of this parameter generator comes from. We can even go a step further and have a parameter generator that defines \(\theta^{(enc)}\triangleq g^{(enc)}({\bf l}_s, {\bf l}_t)\) and \(\theta^{(dec)}\triangleq g^{(dec)}({\bf l}_s, {\bf l}_t)\), thus coupling the encoding and decoding stages for a given language pair. This would make the model more expressive and could thus work better in cases where large amounts of training data are available. In our experiments we stick to the previous, *decoupled*, form, because it has the potential to lead to a common representation among languages, also known as an *interlingua*. An overview diagram of how the contextual parameter generator fits in the architecture of NMT models is shown in Figure 1 in the beginning of this post.

Concretely, because the encoding and decoding stages are decoupled, the encoder is not aware of the target language while generating it, and so, we can take an encoded intermediate representation of a sentence and translate it to any target language. This is because the intermediate representation is independent of any target language. This makes for a stronger argument that the intermediate representation produced by our encoder could be approaching a universal interlingua, more so than methods that are aware of the target language when they perform encoding.

We refer to the functions \(g^{(enc)}\) and \(g^{(dec)}\) as *parameter generator networks*. A simple form that works, and for which we can reason about, is to define the parameter generator networks as simple linear transforms:

$$g^{(enc)}({\bf l}_s) \triangleq {\bf W^{(enc)}} {\bf l}_s,$$

$$g^{(dec)}({\bf l}_t) \triangleq {\bf W^{(dec)}} {\bf l}_t,$$

where \({\bf l}_s, {\bf l}_t \in \mathbb{R}^M\), \({\bf W^{(enc)}} \in \mathbb{R}^{P^{(enc)} \times M}\), \({\bf W^{(dec)}} \in \mathbb{R}^{P^{(dec)} \times M}\), \(M\) is the language embedding size, \(P^{(enc)}\) is the number of parameters of the encoder, and \(P^{(dec)}\) is the number of parameters of the decoder.

Another interpretation of this model is that it imposes a low-rank constraint on the parameters. As opposed to our approach, in the base case of using multiple pairwise models to perform multilingual translation, each model has \(P = P^{(enc)} + P^{(dec)}\) learnable parameters for its encoder and decoder. Given that the models are pairwise, for \(L\) languages, we have a total of \(L(L – 1)\) learnable parameter vectors of size \(P\). On the other hand, using our contextual parameter generator we have a total of \(L\) vectors of size \(M\) (one for each language), and a single matrix of size \(P \times M\). Then, the parameters of the encoder and the decoder, for a single language pair, are defined as a linear combination of the \(M\) columns of that matrix, as shown in the above equations. In our EMNLP 2018 paper, we consider and discuss more options for the parameter generator network, that allow for controllable parameter sharing.

The contextual parameter generator is a generalization of previous approaches for multilingual NMT:

**Pairwise:**\(g\) picks a different parameter set based on the language pair.**Universal:**\(g\) picks the same parameters for all languages.**Per-Language Encoder/Decoder:**\(g\) picks a different set of encoder/decoder parameters based on the languages.

The parameter generator also enables *semi-supervised learning*. Monolingual data can be used to train the shared encoder/decoder networks to translate a sentence from some language to itself (similar to the idea of auto-encoders). This is possible and can help learning because many of the learnable parameters are shared across languages.

Furthermore, *zero-shot translation*, where the model translates between language pairs for which it has seen no explicit training data, is also possible. This is because the same per-language parameters are used to translate to and from a given language, irrespective of the language at the other end. Therefore, as long as we train our model using some language pairs that involve a given language, it is possible to learn to translate in any direction involving that language.

Let us assume that we have trained a model using data for some set of languages, \(\ell_1, \ell_2, \dots, \ell_m\). If we obtain data for some new language \(\ell_n\), we do not have to retrain the whole model from scratch. In fact, we can fix the parameters that are shared across all languages and only learn the embedding for the new language (along with the relevant word embeddings if not using a shared vocabulary). Assuming that we had a sufficient number of languages in the beginning, this may allow us to obtain reasonable translation performance for the new language, with a minimal amount of training (due to the small number of parameters that need to be learned in this case — to put this into perspective, in most of our experiments we used language embeddings of size 8).

For the base case of using multiple pairwise models to perform multilingual translation, each model has \(P + 2WV\) parameters, where \(P = P^{(enc)} + P^{(dec)}\), \(W\) is the word embedding size, and \(V\) is the vocabulary size per language (assumed to be the same across languages, without loss of generality). Given that the models are pairwise, for \(L\) languages, we have a total of \(L(L – 1)(P + 2WV)\)\(=\mathcal{O}(L^2P +2L^2WV)\) learnable parameters. For our approach, using the linear parameter generator network we have a total of \(\mathcal{O}(PM + LWV)\) learnable parameters. Note that the number of encoder/decoder parameters has no dependence on \(L\) now, meaning that our model can easily scale to a large number of languages.

For putting these numbers into perspective, in our experiments we had \(W = 512\), \(V = 20000\), \(L = 8\), \(M = 6\), and \(P\) in the order millions.

We present results from experiments on two datasets in our paper, but here we highlight some of the most interesting ones. In the following table we compare CPG to using the pairwise NMT models approach, as well as the universal one (using Google’s multilingual NMT system). Below are results for the IWSLT-15 dataset, a commonly used small dataset in NMT community.

Pairwise | Universal | CPG | |

En-Cs | 14.89 | 15.92 | 17.22 |

Cs-En | 24.43 | 25.25 | 27.37 |

En-De | 25.99 | 25.92 | 26.77 |

De-En | 30.93 | 29.60 | 31.77 |

En-Fr | 38.25 | 34.40 | 38.32 |

Fr-En | 37.40 | 35.14 | 37.89 |

En-Th | 23.62 | 22.22 | 26.33 |

Th-En | 15.54 | 14.03 | 26.77 |

En-Vi | 27.47 | 25.54 | 29.03 |

Vi-En | 24.03 | 23.19 | 26.38 |

Mean | 26.26 | 25.12 | 27.80 |

The numbers shown in this table represent BLEU scores, the most widely used metric for evaluating MT systems. They represent a measure of precision over \(n\)-grams. More specifically, in this instance they represent a measure of how frequently the MT prediction outputs 4 consecutive words that match 4 consecutive words in the reference translation.

We show similar results for the IWSLT-17 dataset (commonly used challenge dataset for multilingual NMT systems that includes a zero-shot setting) in our paper, where CPG outperforms both the pairwise approach and the universal approach (i.e., Google’s system). Most interestingly, we computed the cosine distance between all pairs of language embeddings learned by CPG:

There are some interesting patterns that indicate that the learned language embeddings are reasonable. For example, we observe that German (*De*) and Dutch (*Nl*) are most similar for the IWSLT-17 dataset, with Italian (*It*) and Romanian (*Ro*) coming second. Furthermore, Romanian and German are the furthest apart for that dataset. It is very encouraging to see that these relationships agree with linguistic knowledge about these languages and the families they belong to. We see similar patterns in the IWSLT-15 results but we focus on IWSLT-17 here, because it is a larger, better quality dataset with more supervised language pairs. These results also uncover relationships between languages that may have been previously unknown. For example, perhaps surprisingly, French (*Fr*) and Vietnamese (*Vi*) appear to be significantly related for the IWSLT-15 dataset results. This is likely due to French influence in Vietnamese due to the occupation of Vietnam by France during the 19th and 20th centuries.

More details can be found in our EMNLP 2018 paper. We have also released an implementation of our approach and experiments as part of a new Scala framework for machine translation (we plan to soon push a large update to the framework that makes reproducing our results and playing around with CPG, and NMT more generally, much easier). It is built on top of TensorFlow Scala and follows a modular NMT design that supports various NMT models, including our baselines. It also contains data loading and preprocessing pipelines that support multiple datasets and languages, and is more efficient than other packages (e.g., tf-nmt). Furthermore, the framework supports various vocabularies, among which we provide a new implementation for the byte-pair encoding (BPE) algorithm that is 2 to 3 orders of magnitude faster than the released one.

We also plan to release another blog post later on in 2019 with some follow-up work we have done on applying contextual parameter generation to other problems.

We would like to thank Otilia Stretcu, Abulhair Saparov, and Maruan Al-Shedivat for the useful feedback they provided in early versions of this paper. This research was supported in part by AFOSR under grant FA95501710218.

**DISCLAIMER:** All opinions expressed in this posts are those of the author and do not represent the views of Carnegie Mellon University.

Emmanouil Antonios Platanios, Mrinmaya Sachan, Graham Neubig, and Tom Mitchell. 2018. *Contextual Parameter Generation for Universal Neural Machine Translation.* In Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium.

Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016a. *Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism.* In Proceedings of NAACL-HLT, pages 866–875.

Thanh-Le Ha, Jan Niehues, and Alexander Waibel. 2016. *Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder.* In Proceedings of the 13th International Workshop on Spoken Language Translation.

Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viegas, Martin Wattenberg, Greg Corrado, et al. 2017. *Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation*. In Transactions of the Association for Computational Linguistics, volume 5, pages 339–351.

Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. *Multi-task Sequence to Sequence Learning*. In International Conference on Learning Representations.

Machine learning algorithms typically have configuration parameters, or hyperparameters, that influence their output and ultimately predictive accuracy (Melis et al., 2018). Some common examples of hyperparameters include learning rate, dropout, and activation function for neural networks, maximum tree depth for random forests, and regularization rate for regularized linear regression.

In practice, applying machine learning solutions requires carefully tuning the hyperparameters pertaining to the model in order to achieve high predictive accuracy. Certain problems like reinforcement learning and GAN training are notoriously hard to train and highly sensitive to hyperparameters (Jaderberg et al., 2017; Roth et al., 2017). In other instances, hyperparameter tuning can drastically improve the performance of a model, e.g., carefully tuning the hyperparameters for an LSTM language model beat out many recently proposed recurrent architectures that claimed to be state-of-the-art (Melis et al, 2018; Merity et al., 2017 ).

Practitioners often tune these hyperparameters manually (i.e., graduate student decent) or default to brute-force methods like systematically searching a grid of hyperparameters (grid search) or randomly sampling hyperparameters (random search), both of which are depicted in Figure 1. In response, the field of hyperparameter optimization addresses the important problem of automating the search for a good hyperparameter configuration quickly and efficiently.

In the current era of machine learning, learning algorithms often contain half-a-dozen hyperparameters (and easily more) and training a single model can take days or weeks rather than minutes or hours. It is simply not feasible to train several models sequentially and wait days, weeks or months to finally choose a model to deploy. In fact, a model may need to be selected in roughly the same wall-clock time needed to train and evaluate only a single hyperparameter configuration. In such a setting, we need to exploit parallelism to have any hope of finding a good configuration in a reasonable time. Luckily, the increased prevalence of cloud computing provides easy access to distributed computing resources and scaling up hyperparameter search to more machines *is*feasible. We argue that tuning computationally heavy models using massive parallelism is the new paradigm for hyperparameter optimization.

Many existing methods for hyperparameter optimization use information from previously trained configurations to inform which hyperparameters to train next (see Figure 1, right). This approach makes sense when models take minutes or hours to train: waiting a few rounds to learn which hyperparameters configurations are more likely to succeed still allows a reasonable window for feedback and iteration. If a machine learning problem fits into this paradigm, using these sequential adaptive hyperparameter selection approaches can provide significant speedups over random search and grid search. However, these methods are difficult to parallelize and generally do not scale well with the number workers.

In the parallel setting, practitioners default to using random and grid search for hyperparameter optimization in the parallel setting because the two methods are trivial to parallelize and easily scale to any number of machines. However, both methods are brute force approaches that scale poorly with the number of hyperparameters. The challenge going forward is how to tackle increasingly more complex hyperparameter optimization tasks with higher dimensions that push the limits of our distributed resources. We propose addressing this challenge with an asynchronous early-stopping approach based on the successive halving algorithm.

Recently we proposed an algorithm that uses the successive halving algorithm (SHA), a well-known multi-armed bandit algorithm, to perform principled early stopping. The successive halving algorithm begins with all candidate configurations in the base rung and proceeds as follows:

- Uniformly allocate a budget to a set of candidate hyperparameter configurations in a given rung.
- Evaluate the performance of all candidate configurations.
- Promote the top half of candidate configurations to the next rung.
- Double the budget per configuration for the next rung and repeat until one configurations remains.

The algorithm can be generalized to allow for a variable rate of elimination η so that only 1/η of configurations are promoted to the next rung. Hence, higher η indicates a more aggressive rate of elimination where all but the top 1/η of configurations are eliminated.

To demonstrate SHA, consider the problem of tuning a grid of 3 hyperparameters for a 2 layer neural network: learning rate (0.1, 0.01, 0.001), momentum (0.85, 0.9, 0.95), and weight decay (0.01, 0.001, 0.0001). This allows for a total of 27 different hyperparameter configurations. In the table below, we show the rungs for SHA run with 27 configurations, a minimum resource per configuration of 1 epoch, and a rate of elimination 3. The synchronized promotions of SHA according to the schedule shown in the table is animated in Figure 2.

Configurations Remaining | Epochs per Configuration | |

Rung 1 | 27 | 1 |

Rung 2 | 9 | 3 |

Rung 3 | 3 | 9 |

Rung 4 | 1 | 27 |

**Table 1: **SHA with η=3 starting with 27 configurations, each allocated a resource of 1 epoch in the first rung.

In the sequential setting, successive halving evaluates orders of magnitude more hyperparameter configurations than random search by adaptively allocating resources to promising configurations. Unfortunately, it is difficult to parallelize because the algorithm takes a set of configurations as input and waits for all configurations in a rung to complete before promoting configurations to the next rung.

To remove the bottleneck created by synchronous promotions, we tweak the successive halving algorithm to grow from the bottom up and promote configurations whenever possible instead of starting with a wide set of configurations and narrowing down. We call this the Asynchronous Successive Halving Algorithm (ASHA).

ASHA begins by assigning workers to add configurations to the bottom rung. When a worker finishes a job and requests a new one, we look at the rungs from top to bottom to see if there are configurations in the top 1/η of each rung that can be promoted to the next rung. If not we assign the worker to add a configuration to the lowest rung to grow the width of the level so that more configurations can be promoted.

Figure 2 and 3 animate the associated promotion schemes for synchronous and asynchronous successive halving when using 10 workers and the associated worker efficiency for each. As shown in Figure 2, the naive way of parallelizing SHA, where each configuration in a rung is distributed across workers, diminishes in efficiency as the number of jobs dwindle for higher rungs. In contrast, asynchronous SHA approaches near 100% resource efficiency as workers are always able to stay busy by expanding the base rung if no configurations can be promoted to higher rungs.

**Figure 2: **Successive halving with synchronous promotions.

**Figure 3: **Successive halving with asynchronous promotions.

In our first set of experiments, we compare ASHA to SHA and PTB on two benchmark tasks on CIFAR-10: (1) tuning a convolutional neural network (CNN) with the cuda-convnet architecture and the same search space as Li et al. (2017); and (2) tuning a CNN architecture with varying number of layers, batch size, and number of filters. PTB is a state-of-the-art evolutionary method that iteratively improves the fitness of a population of configurations after partially training the current population. For a more extensive comparison of ASHA to additional state-of-the-art hyperparameter optimization methods, please take a look at our full paper.

The resources allocated to the rungs as a fraction of the maximum resource per model R by SHA and ASHA are shown in Table 2; the number of remaining configurations in synchronous SHA is shown as well. Note that we use an elimination rate of η=4 so that only the top ¼ of configurations are promoted to the next rung.

Configurations Remaining | Epochs per Configuration | |

Rung 1 | 256 | R/256 |

Rung 2 | 64 | R/64 |

Rung 3 | 16 | R/16 |

Rung 4 | 4 | R/4 |

Rung 5 | 1 | R |

**Table 2:** SHA with η=4 starting with 256 configurations, each allocated a resource of R/256 in the first rung. ASHA simply allocates the indicated resource to configurations in each rung and promotes configurations in the top 1/4th to the rung above.

We compare the methods in both single machine and distributed settings. Figure 4 shows the performance of each search method on a single machine. Our results show that SHA and ASHA outperform PTB on the first benchmark and all three methods perform comparably in the second. Note that ASHA achieves comparable performance to SHA on both benchmarks despite promoting configurations asynchronously.

**Figure 4:** Comparison of hyperparameter optimization methods on 2 benchmark tasks using a single machine. Average across 10 trials is shown with dashed lines representing top and bottom quintiles.

As shown in Figure 5, the story is similar in the distributed setting with 25 workers. For benchmark 1, ASHA evaluated over 1000 configurations in just over 40 minutes with 25 workers (compared to 25 for random search) and found a good configuration (error rate below 0.21) in approximately the time needed to train a single model, whereas it took ASHA nearly 400 minutes to do so in the sequential setting (Figure 4). Notably, we only achieve a 10× speedup on 25 workers due to the relative simplicity of this task, i.e., it only required evaluating a few hundred configurations before identifying a good one in the sequential setting. In contrast, when considering the more difficult search space in benchmark 2, we observe linear speedups with ASHA, as the roughly 700 minutes in the sequential setting (Figure 4) needed to find a configuration with test error below 0.23 is reduced to under 25 minutes in the distributed setting.

We further note that ASHA outperforms PBT on benchmark 1; in fact the minimum and maximum range for ASHA across 5 trials does not overlap with the average for PBT. On benchmark 2, PBT slightly outperforms asynchronous Hyperband and performs comparably to ASHA. However, note that the ranges for the searchers share large overlap and the result is likely not significant. ASHA’s slight outperformance of PBT on these two tasks, coupled with the fact that it is a more principled and general approach (e.g., agnostic to resource type and robust to hyperparameters that change the size of the model), further motivates its use for distributed hyperparameter optimization.

**Figure 5: **Comparison of hyperparameter optimization methods on 2 benchmarks using 25 machines. Average across 5 trials is shown with dashed lines representing min/max ranges. Black vertical dotted lines represent maximum time needed to train a configuration in the search space. Blue vertical dotted line indicates time given to each searcher in the single machine experiments in Figure 4.

We tune a one layer LSTM language model for next word prediction on the Penn Treebank (PTB) dataset. Each tuner is given 500 workers to tune 9 different hyperparameters that control the optimization routine and the model architecture. We evaluate the performance of ASHA and compare to the default tuning method in Vizier, Google’s internal hyperparameter optimizer service, with and without early-stopping.

Our results in Figure 6 show that ASHA is 3x faster than Vizier at finding a good configuration; namely, ASHA is able to find a configuration with perplexity below 80 in the time it takes to train an average configuration for R resource (this search space contains hyperparameters that affect the training time like number of hidden units and batch size), compared to the nearly 3R needed by Vizier.

**Figure 6: **Large-scale ASHA benchmark that takes on the order of weeks to run with 500 workers. The x-axis is measured in units of average time to train a single configuration for R resource. The average across 5 trials is shown, with dashed lines indicating min/max ranges.

Notably, we observe that certain hyperparameter configurations in this benchmark induce perplexities that are orders of magnitude larger than the average case perplexity. Model-based methods that make assumptions on the data distribution, such as Vizier, can degrade in performance without further care to adjust this signal. We attempted to alleviate this by capping perplexity scores at 1000 but this still significantly hampered the performance of Vizier. We view robustness to these types of scenarios as an additional benefit of ASHA and Hyperband.

For more details about the successive halving algorithm and how we guarantee that the algorithm will not eliminate good configurations prematurely, see our original paper for the sequential setting. Finally, we refer interested readers to this paper for more details about the asynchronous version of the successive halving algorithm, an extended related work section, and more extensive empirical studies.

**DISCLAIMER:** All opinions expressed in this posts are those of the author and do not represent the views of CMU.

The blog aims to provide a general-audience medium for the CMU community to share cutting-edge research findings as well as perspectives on the field of machine learning, with easily digestible material that is both accessible and informative to readers with a wide range expertise. Posts are written by students, postdocs, and faculty throughout all of CMU, with blog content curated by a student-led editorial board. Our five inaugural posts over the next month will be written and edited by the editorial board, and moving forward, posts on a variety of machine learning topics will appear approximately bi-weekly.

Check out the about page for more information and the submissions page for contribution guidelines. Look for our first few posts in the coming weeks, we are looking forward to sharing our work with you!

]]>