Are Sixteen Heads Really Better than One?

Since their inception in this 2017 paper by Vaswani et al., transformer models have become a staple of NLP research. They are used in machine translation, language modeling, and in general in most recent state-of-the-art pretrained models (Devlin et al. (2018), Radford et al. (2018), Yang et al. (2019), Liu et al. (2019) among many, many others). A central innovation in the transformer is the ubiquitous use of a multi-headed attention mechanism. In this blog post, we’ll take a closer look and try to understand just how important multiple heads actually are.