Figure 1: The plot on the left shows that increasing the amount of regularization used when running ridge regression (corresponding to moving up higher on the y-axis) implies that ridge begins to tune out the directions in the data that have lower variation (i.e., we see more dark green as we scan from right to left). Meanwhile, the plot on the right actually shows very similar behavior, but this time for a very different estimator: gradient descent when run on the least-squares loss, as we terminate it earlier and earlier (i.e., as we increasingly stop gradient descent far short of when it converges, given again by moving higher up on the y-axis). This raises the question: are these similarities just a coincidence? Or is there something deeper going on here, linking together early-stopped gradient descent to ridge regression? We’ll try to address this question, in this blog post. (By the way, many of the other details in the plot, e.g., the labels “Grad Flow”, “lambda”, and “1/t” will all be explained, later on in the post.)

These days, it seems as though tools from statistics and machine learning are being used almost everywhere. Methods from the literature can now be seen on the critical path in a number of different fields and industries. A consequence of this sort of proliferation is that, increasingly, both non-specialists and specialists alike are being asked to deploy statistical models into the wild. This often makes “simple” methods a natural choice, with “simple” usually meaning (a) “easy-to-implement” and/or (b) “computationally cheap”.

That being the case, what could be simpler than gradient descent? Gradient descent is often quite easy to implement and computationally affordable, which probably explains (at least some of) its popularity. One thing people tend to do with gradient descent is to “stop it early”, by which I mean: people tend to *not* run gradient descent until convergence. Why not? Well, it has been long observed (by many, e.g., here, here, and here) that early-stopping gradient descent has a kind of regularizing effect—even if the loss function that gradient descent is run on has no *explicit* regularizer; see the two figures above, for a bit more of this kind of motivation. In fact, it has been suggested (see, e.g., here and here) that the *implicit regularization* properties of optimization algorithms may explain at least in part some of the recent successes of deep neural networks in practice, making implicit regularization a very lively and growing area of research right now.

In our recent paper, we precisely characterize the implicit regularization effect of early-stopping gradient descent, when run on the least-squares regression problem. Apart from being interesting in its own right, the hope is that this sort of characterization will also be useful to practitioners.

(By the way, in case you think that least-squares is an elementary and boring problem: I would counter that and say that least-squares turns out to actually be quite an interesting problem to study, and moreover it represents a good starting point before moving onto more exotic problems. In any event, I’ll mention ideas for future work at the end of this post!)

To fix ideas, here is the standard least-squares regression problem:

\begin{equation}

\tag{1}

\hat \beta^{\textrm{ls}} \in \underset{\beta \in \mathbb{R}^p}{\mathrm{argmin}} \; \frac{1}{2n} \|y-X\beta\|_2^2.

\end{equation}

To be clear, \(y \in \mathbb{R}^n\) is the response, \(X \in \mathbb{R}^{n \times p}\) is the data matrix, \(n\) denotes the number of samples, and \(p\) the number of features. Running gradient descent on the problem (1) just amounts to

\begin{equation}

\tag{2}

\beta^{(k)} = \beta^{(k-1)} + \epsilon \cdot \frac{X^T}{n} (y – X \beta^{(k-1)}),

\end{equation}

where \(k=1,2,3,\ldots\) is an iteration counter, and \(\epsilon > 0\) is a fixed step size. At this point, it helps to also consider ridge regression:

\begin{equation}

\tag{3}

\hat \beta^{\textrm{ridge}}(\lambda) = \underset{\beta \in \mathbb{R}^p}{\mathrm{argmin}} \; \frac{1}{n} \|y-X\beta\|_2^2 + \lambda

\|\beta\|_2^2,

\end{equation}

where \(\lambda \geq 0\) is a tuning parameter.

It might already be intuitively obvious to you that (a) running (2) until convergence is equivalent to running (3) with \(\lambda = 0\); and (b) assuming the initialization \(\beta^{(0)} = 0\), not taking any steps in (2) just yields the null model, equivalent to running (3) with \(\lambda \to \infty\). Put differently, it seems as though running gradient descent for * longer* corresponds to

Here, we are plotting the *estimation risk* (defined next) of ridge regression and (essentially) gradient descent on the y-axis, vs. the ridge regularization strength \(\lambda = 1/t\) (or, equivalently, the inverse of the number of gradient descent iterations) on the x-axis; the data was generated by drawing samples from a normal distribution, but similar results hold for other distributions as well. I’ll explain the reason for the “essentially” in just a minute.

First of all, by the *estimation risk* of an estimator \(\hat \beta\), I simply mean

\begin{equation}

\mathrm{Risk}(\hat \beta;\beta_0) := \mathbb{E} \| \hat \beta – \beta_0 \|_2^2,

\end{equation}

where here we are taking an expectation over the randomness in the response \(y\), which is assumed to follow a (parametric) distribution with mean \(X \beta_0\) and covariance \(\sigma^2 I\), where \(\sigma^2 > 0\). For now, we fix both the underlying coefficients \(\beta_0 \in \mathbb{R}^p\) as well as the data \(X\); we will allow both to be random, later on.

Back to the plot: in solid black and red, we plot the estimation risk of ridge regression and gradient descent, respectively, while we plot their limiting (i.e., large \(n,p\)) estimation risks in dashed black and red. What should be clear for now is that, aside from agreeing at their endpoints, the black and red curves are * very* close everywhere else, as well. In fact, it appears as though the risk of gradient descent cannot be more than, say, 1.2 times that of ridge regression.

Actually, maybe a second picture can also be of some help here. Below, we plot the same risks as we did above, but now the x-axis is an estimator’s achieved model complexity, as measured by its \(\ell_2\)-norm; the point of this second plot is that the risk curves are now seen to be ** virtually identical**. So, hopefully that reinforces the point from earlier. Unfortunately, studying this sort of setup turns out to be a little complicated, so we’ll stick to the setup corresponding to the first plot, for the rest of this post.

The reason for the “essentially” parenthetical above was that in our paper we actually study gradient descent *with infinitesimally small step sizes*, which is often called *gradient flow *(this is why “Grad flow” appears in the legend of the plot, above). In contrast to the gradient descent iteration (2), gradient flow for (1) can be expressed as

\begin{equation}

\tag{4}

\hat \beta^{\textrm{gf}}(t) = (X^T X)^+ (I – \exp(-t X^T X/n)) X^T y,

\end{equation}

where \(A^+\) denotes the Moore-Penrose pseudo-inverse, \(\exp(A)\) denotes the matrix exponential, and \(t\) denotes “time”. This turns out to be the key to the analysis (and standard tools from numerical analysis can connect results obtained for (4) to those for (2)). Here is a simplified version of one result from our paper (i.e., Theorem 1) that follows rather immediately after taking the continuous-time viewpoint.

**Theorem (simplified). **Under the conditions given above, for all underlying coefficients \(\beta_0 \in \mathbb{R}^p\) and times \(t \geq 0\), we have that \(\mathrm{Risk}(\hat \beta^{\textrm{gf}}(t);\beta_0) \leq 1.6862 \cdot \mathrm{Risk}(\hat \beta^{\textrm{ridge}}(1/t);\beta_0)\).

In words, the estimation risk of gradient flow is no more than 1.6862 times that of ridge, at any point along their paths, provided we “line them up” by taking the ridge regularization strength as \(\lambda = 1/t\) (which seems fairly natural, as discussed above). The constant of 1.6862 also seems to match up pretty well with the constant of 1.2 that we noticed empirically above.

To be completely clear, we are certainly not the first authors to relate gradient descent and \(\ell_2\)-regularization (see our paper for a mention of related work), but we are not aware of any other kind of result that does so with this level of specificity as well as generality; more on this latter aspect after a proof of the result.

**Proof (sketch).** To show the result, we need two simple facts: first, for \(x \geq 0\), it holds that (a) \(\exp(-x) \leq 1/(1+x)\); and second (b) \(1-\exp(-x) \leq 1.2985 \cdot x/(1+x)\). We also need a little bit of notation: let \(s_i, v_i, \; i=1,\ldots,p\) denote the eigenvalues and eigenvectors, respectively, of \((1/n) X^T X\). Owing to the continuous-time representation, it turns out the estimation risk of gradient flow can be written as \(\mathrm{Risk}(\hat \beta^{\textrm{gf}}(t);\beta_0) = \sum_{i=1}^p a_i\), where

\begin{equation*}

a_i = |v_i^T \beta_0|^2 \exp(-2 t s_i) +

\frac{\sigma^2}{n} \frac{(1 – \exp(-t s_i))^2}{s_i},

\end{equation*}

while that of ridge regression can be written as \(\mathrm{Risk}(\hat \beta^{\textrm{ridge}}(\lambda);\beta_0) = \sum_{i=1}^p b_i\), where

\begin{equation*}

b_i = |v_i^T \beta_0|^2 \frac{\lambda^2}{(s_i + \lambda)^2} +

\frac{\sigma^2}{n} \frac{s_i}{(s_i + \lambda)^2}.

\end{equation*}

Now, using facts (a) and (b) (as well as a change of variables), we get that

\begin{align*}

a_i &\leq |v_i^T \beta_0|^2 \frac{1}{(1 + t s_i)^2} +

\frac{\sigma^2}{n} 1.2985^2 \frac{t^2 s_i}{(1 + t s_i)^2} \\

&\leq 1.6862 \bigg(|v_i^T \beta_0|^2 \frac{(1/t)^2}{(1/t + s_i)^2} +

\frac{\sigma^2}{n} \frac{s_i}{(1/t + s_i)^2} \bigg) \\

&= 1.6862 \, b_i,

\end{align*}

where, again, we used \(\lambda = 1/t\). Summing over \(i=1,\ldots,p\) in the above gives the result.

**Going (much) further**

Earlier, I hinted that it was possible to generalize the result presented above. In fact, in our paper, we show how to obtain the same result for various other notions of risk:

- Bayes estimation risk (i.e., where \(\beta_0\) is assumed to follow some prior)
- In-sample (Bayes) prediction risk (i.e., \( (1/n) \mathbb{E} \| X \hat \beta – X \beta_0 \|_2^2 \))
- Out-of-sample Bayes prediction risk (i.e., \( \mathbb{E}[(x_0^T \hat \beta – x_0^T \beta_0)^2] \), where \(x_0\) is a test point).

See (the rest of) Theorem 1 as well as Theorem 2 in the paper, for details. Finally, it is also possible to obtain even tighter results in all these settings when we examine the risks of optimally-tuned ridge and gradient flow; see Theorem 3.

To finish up, I’ll (*very*) briefly mention some of the insights that we get by examining the above issues from an asymptotic viewpoint. In a really nice piece of recent work, it was shown that under a high-dimensional asymptotic setup with \(p,n \to \infty\) such that \(p/n \to \gamma \in (0,\infty)\), the out-of-sample Bayes prediction risk of ridge regression converges to the expression

\begin{equation}

\tag{5}

\sigma^2 \gamma \big[

\theta(\lambda) + \lambda (1 – \alpha_0 \lambda) \theta'(\lambda) \big],

\end{equation}

almost surely, for each \(\lambda > 0\), where \(\alpha_0\) is a fixed constant depending on the variance of the prior on the coefficients. Also, \(\theta(\lambda)\) is a functional (whose exact form is not important for this brief description) that depends on \(\sigma^2, \lambda, \gamma\), as well as the limiting distribution of the eigenvalues of the sample covariance matrix.

Where I think things get really interesting is that it turns out for gradient flow, the analogous risk has the following almost sure limit (see Theorem 6 in our paper for details):

\begin{equation}

\tag{6}

\sigma^2 \gamma \bigg[ \alpha_0 \mathcal L^{-1}(\theta)(2t) + 2 \int_0^t \big( \mathcal L^{-1}(\theta)(u) – \mathcal L^{-1}(\theta)(2u) \big) \, du\bigg],

\end{equation}

where \(\mathcal L^{-1}(\theta)\) denotes the inverse Laplace transform of \(\theta\). In a nice bit of symmetry, note that (5) features the functional \(\theta\) and its derivative, whereas (6) features the inverse Laplace transform of the functional \(\mathcal L^{-1}(\theta)\) and its antiderivative. Moreover, note that (6) is an asymptotically *exact* risk expression, i.e., there are no hidden constants or anything like that. The key to the analysis here turns out to be applying recent tools from random matrix theory to (functionals of) the sample covariance matrix \((1/n) X^T X\). Again, that was a very brief sketch of the result … but hopefully it at least got across some (more) of the interesting duality between ridge regression and gradient descent.

If you found this post interesting, that’s great! There is a number of other connections between ridge regression and gradient descent that we point out in the paper, which I didn’t have the space to get into here. There are also plenty of places to go next, including studying stochastic gradient descent and moving beyond least-squares (both of which we are doing right now). Feel free to get in touch if you would like to learn more.

**DISCLAIMER:** All opinions expressed in this posts are those of the author and do not represent the views of Carnegie Mellon University.

Figure 1: Overview of the contextual parameter generator that is introduced in this post. The top part of the figure shows a typical neural machine translation system (consisting of an encoder and a decoder network). The bottom part, shown in red, shows our parameter generator component.

Machine translation is the problem of translating sentences from some source language to a target language. Neural machine translation (NMT) directly models the mapping of a source language to a target language without any need for training or tuning any component of the system separately. This has led to a rapid progress in NMT and its successful adoption in many large-scale settings.

NMT systems typically consist of two main components: the *encoder* and the *decoder*. The encoder takes as input the source language sentence and generates some latent representation for it (e.g., a fixed-size vector). The decoder then takes that latent representation as input and generates a sentence in the target language. The sentence generation is often done in an autoregressive manner (i.e., words are generated one-by-one in a left-to-right manner).

It is easy to see how this kind of architecture can be used to translate from a single source language to a single target language. However, translating between arbitrary pairs of languages is not as simple. This problem is referred to as *multilingual machine translation*, and it is the problem we tackle. Currently, there exist three approaches for multilingual NMT.

Assuming that we have \(L\) languages and \(P\) trainable parameters per NMT model (these could be, for example, the weights and biases of a recurrent neural network) that translates from a single source language to a single target language, we have:

**Pairwise:**Use a pairwise NMT model per language pair. This results in \(L^2\) separate models, each with \(P\) parameters. The main issue with this approach is that no information is shared between different languages, and even between models translating from English to other languages, for example. This is especially problematic for low-resource languages, where we have very little training data available, and would ideally want to leverage the high availability of data for other languages, such as English.**Universal:**Use a single NMT model for all language pairs. In this case, an artificial word can be added to the source sentence denoting the target language for the translation. This results in a single model with \(P\) parameters, since the whole model is shared across all languages. This approach can result in overfitting to the high-resource languages. It can also limit the expressivity of the translation model for languages that are more “different” (e.g., for Turkish, if training using Italian, Spanish, Romanian, and Turkish). Ideally, we want something in-between pairwise and universal models.**Per-Language Encoder/Decoder:**Use a separate encoder and a separate decoder for each language. This results in a total of \(LP\) parameters and lies somewhere between the pairwise and the universal approaches. However, no information can be shared between languages that are similar (e.g., Italian and Spanish).

At EMNLP 2018, we introduced the *contextual parameter generator (CPG)*, a new way to share information across different languages that generalizes the above three approaches, while mitigating their issues and allowing explicit control over the amount of sharing.

Let us denote the source language for a given sentence pair by \(\ell_s\) and the target language by \(\ell_t\). When using the contextual parameter generator, the parameters of the encoder are defined as \(\theta^{(enc)}\triangleq g^{(enc)}({\bf l}_s)\), for some function \(g^{(enc)}\), where \({\bf l}_s\) denotes a language embedding for the source language \(\ell_s\). Similarly, the parameters of the decoder are defined as \(\theta^{(dec)}\triangleq g^{(dec)}({\bf l}_t)\) for some function \(g^{(dec)}\), where \({\bf l}_t\) denotes a language embedding for the target language \(\ell_t\). Our general formulation does not impose any constraints on the functional form of \(g^{(enc)}\) and \(g^{(dec)}\). In this case, you can think of the source language, \(\ell_s\), as a context for the encoder. The parameters of the encoder depend on its context, but its architecture is common across all contexts. We can make a similar argument for the decoder, and that is where the name of this parameter generator comes from. We can even go a step further and have a parameter generator that defines \(\theta^{(enc)}\triangleq g^{(enc)}({\bf l}_s, {\bf l}_t)\) and \(\theta^{(dec)}\triangleq g^{(dec)}({\bf l}_s, {\bf l}_t)\), thus coupling the encoding and decoding stages for a given language pair. This would make the model more expressive and could thus work better in cases where large amounts of training data are available. In our experiments we stick to the previous, *decoupled*, form, because it has the potential to lead to a common representation among languages, also known as an *interlingua*. An overview diagram of how the contextual parameter generator fits in the architecture of NMT models is shown in Figure 1 in the beginning of this post.

Concretely, because the encoding and decoding stages are decoupled, the encoder is not aware of the target language while generating it, and so, we can take an encoded intermediate representation of a sentence and translate it to any target language. This is because the intermediate representation is independent of any target language. This makes for a stronger argument that the intermediate representation produced by our encoder could be approaching a universal interlingua, more so than methods that are aware of the target language when they perform encoding.

We refer to the functions \(g^{(enc)}\) and \(g^{(dec)}\) as *parameter generator networks*. A simple form that works, and for which we can reason about, is to define the parameter generator networks as simple linear transforms:

$$g^{(enc)}({\bf l}_s) \triangleq {\bf W^{(enc)}} {\bf l}_s,$$

$$g^{(dec)}({\bf l}_t) \triangleq {\bf W^{(dec)}} {\bf l}_t,$$

where \({\bf l}_s, {\bf l}_t \in \mathbb{R}^M\), \({\bf W^{(enc)}} \in \mathbb{R}^{P^{(enc)} \times M}\), \({\bf W^{(dec)}} \in \mathbb{R}^{P^{(dec)} \times M}\), \(M\) is the language embedding size, \(P^{(enc)}\) is the number of parameters of the encoder, and \(P^{(dec)}\) is the number of parameters of the decoder.

Another interpretation of this model is that it imposes a low-rank constraint on the parameters. As opposed to our approach, in the base case of using multiple pairwise models to perform multilingual translation, each model has \(P = P^{(enc)} + P^{(dec)}\) learnable parameters for its encoder and decoder. Given that the models are pairwise, for \(L\) languages, we have a total of \(L(L – 1)\) learnable parameter vectors of size \(P\). On the other hand, using our contextual parameter generator we have a total of \(L\) vectors of size \(M\) (one for each language), and a single matrix of size \(P \times M\). Then, the parameters of the encoder and the decoder, for a single language pair, are defined as a linear combination of the \(M\) columns of that matrix, as shown in the above equations. In our EMNLP 2018 paper, we consider and discuss more options for the parameter generator network, that allow for controllable parameter sharing.

The contextual parameter generator is a generalization of previous approaches for multilingual NMT:

**Pairwise:**\(g\) picks a different parameter set based on the language pair.**Universal:**\(g\) picks the same parameters for all languages.**Per-Language Encoder/Decoder:**\(g\) picks a different set of encoder/decoder parameters based on the languages.

The parameter generator also enables *semi-supervised learning*. Monolingual data can be used to train the shared encoder/decoder networks to translate a sentence from some language to itself (similar to the idea of auto-encoders). This is possible and can help learning because many of the learnable parameters are shared across languages.

Furthermore, *zero-shot translation*, where the model translates between language pairs for which it has seen no explicit training data, is also possible. This is because the same per-language parameters are used to translate to and from a given language, irrespective of the language at the other end. Therefore, as long as we train our model using some language pairs that involve a given language, it is possible to learn to translate in any direction involving that language.

Let us assume that we have trained a model using data for some set of languages, \(\ell_1, \ell_2, \dots, \ell_m\). If we obtain data for some new language \(\ell_n\), we do not have to retrain the whole model from scratch. In fact, we can fix the parameters that are shared across all languages and only learn the embedding for the new language (along with the relevant word embeddings if not using a shared vocabulary). Assuming that we had a sufficient number of languages in the beginning, this may allow us to obtain reasonable translation performance for the new language, with a minimal amount of training (due to the small number of parameters that need to be learned in this case — to put this into perspective, in most of our experiments we used language embeddings of size 8).

For the base case of using multiple pairwise models to perform multilingual translation, each model has \(P + 2WV\) parameters, where \(P = P^{(enc)} + P^{(dec)}\), \(W\) is the word embedding size, and \(V\) is the vocabulary size per language (assumed to be the same across languages, without loss of generality). Given that the models are pairwise, for \(L\) languages, we have a total of \(L(L – 1)(P + 2WV)\)\(=\mathcal{O}(L^2P +2L^2WV)\) learnable parameters. For our approach, using the linear parameter generator network we have a total of \(\mathcal{O}(PM + LWV)\) learnable parameters. Note that the number of encoder/decoder parameters has no dependence on \(L\) now, meaning that our model can easily scale to a large number of languages.

For putting these numbers into perspective, in our experiments we had \(W = 512\), \(V = 20000\), \(L = 8\), \(M = 6\), and \(P\) in the order millions.

We present results from experiments on two datasets in our paper, but here we highlight some of the most interesting ones. In the following table we compare CPG to using the pairwise NMT models approach, as well as the universal one (using Google’s multilingual NMT system). Below are results for the IWSLT-15 dataset, a commonly used small dataset in NMT community.

Pairwise | Universal | CPG | |

En-Cs | 14.89 | 15.92 | 17.22 |

Cs-En | 24.43 | 25.25 | 27.37 |

En-De | 25.99 | 25.92 | 26.77 |

De-En | 30.93 | 29.60 | 31.77 |

En-Fr | 38.25 | 34.40 | 38.32 |

Fr-En | 37.40 | 35.14 | 37.89 |

En-Th | 23.62 | 22.22 | 26.33 |

Th-En | 15.54 | 14.03 | 26.77 |

En-Vi | 27.47 | 25.54 | 29.03 |

Vi-En | 24.03 | 23.19 | 26.38 |

Mean | 26.26 | 25.12 | 27.80 |

The numbers shown in this table represent BLEU scores, the most widely used metric for evaluating MT systems. They represent a measure of precision over \(n\)-grams. More specifically, in this instance they represent a measure of how frequently the MT prediction outputs 4 consecutive words that match 4 consecutive words in the reference translation.

We show similar results for the IWSLT-17 dataset (commonly used challenge dataset for multilingual NMT systems that includes a zero-shot setting) in our paper, where CPG outperforms both the pairwise approach and the universal approach (i.e., Google’s system). Most interestingly, we computed the cosine distance between all pairs of language embeddings learned by CPG:

There are some interesting patterns that indicate that the learned language embeddings are reasonable. For example, we observe that German (*De*) and Dutch (*Nl*) are most similar for the IWSLT-17 dataset, with Italian (*It*) and Romanian (*Ro*) coming second. Furthermore, Romanian and German are the furthest apart for that dataset. It is very encouraging to see that these relationships agree with linguistic knowledge about these languages and the families they belong to. We see similar patterns in the IWSLT-15 results but we focus on IWSLT-17 here, because it is a larger, better quality dataset with more supervised language pairs. These results also uncover relationships between languages that may have been previously unknown. For example, perhaps surprisingly, French (*Fr*) and Vietnamese (*Vi*) appear to be significantly related for the IWSLT-15 dataset results. This is likely due to French influence in Vietnamese due to the occupation of Vietnam by France during the 19th and 20th centuries.

More details can be found in our EMNLP 2018 paper. We have also released an implementation of our approach and experiments as part of a new Scala framework for machine translation (we plan to soon push a large update to the framework that makes reproducing our results and playing around with CPG, and NMT more generally, much easier). It is built on top of TensorFlow Scala and follows a modular NMT design that supports various NMT models, including our baselines. It also contains data loading and preprocessing pipelines that support multiple datasets and languages, and is more efficient than other packages (e.g., tf-nmt). Furthermore, the framework supports various vocabularies, among which we provide a new implementation for the byte-pair encoding (BPE) algorithm that is 2 to 3 orders of magnitude faster than the released one.

We also plan to release another blog post later on in 2019 with some follow-up work we have done on applying contextual parameter generation to other problems.

We would like to thank Otilia Stretcu, Abulhair Saparov, and Maruan Al-Shedivat for the useful feedback they provided in early versions of this paper. This research was supported in part by AFOSR under grant FA95501710218.

**DISCLAIMER:** All opinions expressed in this posts are those of the author and do not represent the views of Carnegie Mellon University.

Emmanouil Antonios Platanios, Mrinmaya Sachan, Graham Neubig, and Tom Mitchell. 2018. *Contextual Parameter Generation for Universal Neural Machine Translation.* In Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium.

Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016a. *Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism.* In Proceedings of NAACL-HLT, pages 866–875.

Thanh-Le Ha, Jan Niehues, and Alexander Waibel. 2016. *Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder.* In Proceedings of the 13th International Workshop on Spoken Language Translation.

Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viegas, Martin Wattenberg, Greg Corrado, et al. 2017. *Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation*. In Transactions of the Association for Computational Linguistics, volume 5, pages 339–351.

Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. *Multi-task Sequence to Sequence Learning*. In International Conference on Learning Representations.

Machine learning algorithms typically have configuration parameters, or hyperparameters, that influence their output and ultimately predictive accuracy (Melis et al., 2018). Some common examples of hyperparameters include learning rate, dropout, and activation function for neural networks, maximum tree depth for random forests, and regularization rate for regularized linear regression.

In practice, applying machine learning solutions requires carefully tuning the hyperparameters pertaining to the model in order to achieve high predictive accuracy. Certain problems like reinforcement learning and GAN training are notoriously hard to train and highly sensitive to hyperparameters (Jaderberg et al., 2017; Roth et al., 2017). In other instances, hyperparameter tuning can drastically improve the performance of a model, e.g., carefully tuning the hyperparameters for an LSTM language model beat out many recently proposed recurrent architectures that claimed to be state-of-the-art (Melis et al, 2018; Merity et al., 2017 ).

Practitioners often tune these hyperparameters manually (i.e., graduate student decent) or default to brute-force methods like systematically searching a grid of hyperparameters (grid search) or randomly sampling hyperparameters (random search), both of which are depicted in Figure 1. In response, the field of hyperparameter optimization addresses the important problem of automating the search for a good hyperparameter configuration quickly and efficiently.

In the current era of machine learning, learning algorithms often contain half-a-dozen hyperparameters (and easily more) and training a single model can take days or weeks rather than minutes or hours. It is simply not feasible to train several models sequentially and wait days, weeks or months to finally choose a model to deploy. In fact, a model may need to be selected in roughly the same wall-clock time needed to train and evaluate only a single hyperparameter configuration. In such a setting, we need to exploit parallelism to have any hope of finding a good configuration in a reasonable time. Luckily, the increased prevalence of cloud computing provides easy access to distributed computing resources and scaling up hyperparameter search to more machines *is*feasible. We argue that tuning computationally heavy models using massive parallelism is the new paradigm for hyperparameter optimization.

Many existing methods for hyperparameter optimization use information from previously trained configurations to inform which hyperparameters to train next (see Figure 1, right). This approach makes sense when models take minutes or hours to train: waiting a few rounds to learn which hyperparameters configurations are more likely to succeed still allows a reasonable window for feedback and iteration. If a machine learning problem fits into this paradigm, using these sequential adaptive hyperparameter selection approaches can provide significant speedups over random search and grid search. However, these methods are difficult to parallelize and generally do not scale well with the number workers.

In the parallel setting, practitioners default to using random and grid search for hyperparameter optimization in the parallel setting because the two methods are trivial to parallelize and easily scale to any number of machines. However, both methods are brute force approaches that scale poorly with the number of hyperparameters. The challenge going forward is how to tackle increasingly more complex hyperparameter optimization tasks with higher dimensions that push the limits of our distributed resources. We propose addressing this challenge with an asynchronous early-stopping approach based on the successive halving algorithm.

Recently we proposed an algorithm that uses the successive halving algorithm (SHA), a well-known multi-armed bandit algorithm, to perform principled early stopping. The successive halving algorithm begins with all candidate configurations in the base rung and proceeds as follows:

- Uniformly allocate a budget to a set of candidate hyperparameter configurations in a given rung.
- Evaluate the performance of all candidate configurations.
- Promote the top half of candidate configurations to the next rung.
- Double the budget per configuration for the next rung and repeat until one configurations remains.

The algorithm can be generalized to allow for a variable rate of elimination η so that only 1/η of configurations are promoted to the next rung. Hence, higher η indicates a more aggressive rate of elimination where all but the top 1/η of configurations are eliminated.

To demonstrate SHA, consider the problem of tuning a grid of 3 hyperparameters for a 2 layer neural network: learning rate (0.1, 0.01, 0.001), momentum (0.85, 0.9, 0.95), and weight decay (0.01, 0.001, 0.0001). This allows for a total of 27 different hyperparameter configurations. In the table below, we show the rungs for SHA run with 27 configurations, a minimum resource per configuration of 1 epoch, and a rate of elimination 3. The synchronized promotions of SHA according to the schedule shown in the table is animated in Figure 2.

Configurations Remaining | Epochs per Configuration | |

Rung 1 | 27 | 1 |

Rung 2 | 9 | 3 |

Rung 3 | 3 | 9 |

Rung 4 | 1 | 27 |

**Table 1: **SHA with η=3 starting with 27 configurations, each allocated a resource of 1 epoch in the first rung.

In the sequential setting, successive halving evaluates orders of magnitude more hyperparameter configurations than random search by adaptively allocating resources to promising configurations. Unfortunately, it is difficult to parallelize because the algorithm takes a set of configurations as input and waits for all configurations in a rung to complete before promoting configurations to the next rung.

To remove the bottleneck created by synchronous promotions, we tweak the successive halving algorithm to grow from the bottom up and promote configurations whenever possible instead of starting with a wide set of configurations and narrowing down. We call this the Asynchronous Successive Halving Algorithm (ASHA).

ASHA begins by assigning workers to add configurations to the bottom rung. When a worker finishes a job and requests a new one, we look at the rungs from top to bottom to see if there are configurations in the top 1/η of each rung that can be promoted to the next rung. If not we assign the worker to add a configuration to the lowest rung to grow the width of the level so that more configurations can be promoted.

Figure 2 and 3 animate the associated promotion schemes for synchronous and asynchronous successive halving when using 10 workers and the associated worker efficiency for each. As shown in Figure 2, the naive way of parallelizing SHA, where each configuration in a rung is distributed across workers, diminishes in efficiency as the number of jobs dwindle for higher rungs. In contrast, asynchronous SHA approaches near 100% resource efficiency as workers are always able to stay busy by expanding the base rung if no configurations can be promoted to higher rungs.

**Figure 2: **Successive halving with synchronous promotions.

**Figure 3: **Successive halving with asynchronous promotions.

In our first set of experiments, we compare ASHA to SHA and PTB on two benchmark tasks on CIFAR-10: (1) tuning a convolutional neural network (CNN) with the cuda-convnet architecture and the same search space as Li et al. (2017); and (2) tuning a CNN architecture with varying number of layers, batch size, and number of filters. PTB is a state-of-the-art evolutionary method that iteratively improves the fitness of a population of configurations after partially training the current population. For a more extensive comparison of ASHA to additional state-of-the-art hyperparameter optimization methods, please take a look at our full paper.

The resources allocated to the rungs as a fraction of the maximum resource per model R by SHA and ASHA are shown in Table 2; the number of remaining configurations in synchronous SHA is shown as well. Note that we use an elimination rate of η=4 so that only the top ¼ of configurations are promoted to the next rung.

Configurations Remaining | Epochs per Configuration | |

Rung 1 | 256 | R/256 |

Rung 2 | 64 | R/64 |

Rung 3 | 16 | R/16 |

Rung 4 | 4 | R/4 |

Rung 5 | 1 | R |

**Table 2:** SHA with η=4 starting with 256 configurations, each allocated a resource of R/256 in the first rung. ASHA simply allocates the indicated resource to configurations in each rung and promotes configurations in the top 1/4th to the rung above.

We compare the methods in both single machine and distributed settings. Figure 4 shows the performance of each search method on a single machine. Our results show that SHA and ASHA outperform PTB on the first benchmark and all three methods perform comparably in the second. Note that ASHA achieves comparable performance to SHA on both benchmarks despite promoting configurations asynchronously.

**Figure 4:** Comparison of hyperparameter optimization methods on 2 benchmark tasks using a single machine. Average across 10 trials is shown with dashed lines representing top and bottom quintiles.

As shown in Figure 5, the story is similar in the distributed setting with 25 workers. For benchmark 1, ASHA evaluated over 1000 configurations in just over 40 minutes with 25 workers (compared to 25 for random search) and found a good configuration (error rate below 0.21) in approximately the time needed to train a single model, whereas it took ASHA nearly 400 minutes to do so in the sequential setting (Figure 4). Notably, we only achieve a 10× speedup on 25 workers due to the relative simplicity of this task, i.e., it only required evaluating a few hundred configurations before identifying a good one in the sequential setting. In contrast, when considering the more difficult search space in benchmark 2, we observe linear speedups with ASHA, as the roughly 700 minutes in the sequential setting (Figure 4) needed to find a configuration with test error below 0.23 is reduced to under 25 minutes in the distributed setting.

We further note that ASHA outperforms PBT on benchmark 1; in fact the minimum and maximum range for ASHA across 5 trials does not overlap with the average for PBT. On benchmark 2, PBT slightly outperforms asynchronous Hyperband and performs comparably to ASHA. However, note that the ranges for the searchers share large overlap and the result is likely not significant. ASHA’s slight outperformance of PBT on these two tasks, coupled with the fact that it is a more principled and general approach (e.g., agnostic to resource type and robust to hyperparameters that change the size of the model), further motivates its use for distributed hyperparameter optimization.

**Figure 5: **Comparison of hyperparameter optimization methods on 2 benchmarks using 25 machines. Average across 5 trials is shown with dashed lines representing min/max ranges. Black vertical dotted lines represent maximum time needed to train a configuration in the search space. Blue vertical dotted line indicates time given to each searcher in the single machine experiments in Figure 4.

We tune a one layer LSTM language model for next word prediction on the Penn Treebank (PTB) dataset. Each tuner is given 500 workers to tune 9 different hyperparameters that control the optimization routine and the model architecture. We evaluate the performance of ASHA and compare to the default tuning method in Vizier, Google’s internal hyperparameter optimizer service, with and without early-stopping.

Our results in Figure 6 show that ASHA is 3x faster than Vizier at finding a good configuration; namely, ASHA is able to find a configuration with perplexity below 80 in the time it takes to train an average configuration for R resource (this search space contains hyperparameters that affect the training time like number of hidden units and batch size), compared to the nearly 3R needed by Vizier.

**Figure 6: **Large-scale ASHA benchmark that takes on the order of weeks to run with 500 workers. The x-axis is measured in units of average time to train a single configuration for R resource. The average across 5 trials is shown, with dashed lines indicating min/max ranges.

Notably, we observe that certain hyperparameter configurations in this benchmark induce perplexities that are orders of magnitude larger than the average case perplexity. Model-based methods that make assumptions on the data distribution, such as Vizier, can degrade in performance without further care to adjust this signal. We attempted to alleviate this by capping perplexity scores at 1000 but this still significantly hampered the performance of Vizier. We view robustness to these types of scenarios as an additional benefit of ASHA and Hyperband.

For more details about the successive halving algorithm and how we guarantee that the algorithm will not eliminate good configurations prematurely, see our original paper for the sequential setting. Finally, we refer interested readers to this paper for more details about the asynchronous version of the successive halving algorithm, an extended related work section, and more extensive empirical studies.

**DISCLAIMER:** All opinions expressed in this posts are those of the author and do not represent the views of CMU.

The blog aims to provide a general-audience medium for the CMU community to share cutting-edge research findings as well as perspectives on the field of machine learning, with easily digestible material that is both accessible and informative to readers with a wide range expertise. Posts are written by students, postdocs, and faculty throughout all of CMU, with blog content curated by a student-led editorial board. Our five inaugural posts over the next month will be written and edited by the editorial board, and moving forward, posts on a variety of machine learning topics will appear approximately bi-weekly.

Check out the about page for more information and the submissions page for contribution guidelines. Look for our first few posts in the coming weeks, we are looking forward to sharing our work with you!

]]>