Are Sixteen Heads Really Better than One?
Since their inception in this 2017 paper by Vaswani et al., transformer models have become a staple of NLP research. They are used in machine translation, language modeling, and in general in most recent state-of-the-art pretrained models (Devlin et al. (2018), Radford et al. (2018), Yang et al. (2019), Liu et al. (2019) among many, many others). A central innovation in the transformer is the ubiquitous use of a multi-headed attention mechanism. In this blog post, we’ll take a closer look and try to understand just how important multiple heads actually are.
Copy and paste this URL into your WordPress site to embed
Copy and paste this code into your site to embed