
Approximation of the standard attention module of quadratic complexity in Transformers with a linear attention module in Performers obtained by decoupling the matrices in lower rank decomposition.
The recent paper “Rethinking Attention with Performers” introduced the Performer, a new model that approximates Transformer architectures and significantly improves their space and time complexity. A new blog post by our Sepp Hochreiter and his team, “Looking at the Performer from a Hopfield point of view”, explains the model in detail and discusses the connection between Performers and classical Hopfield networks.
Transformers are powerful neural network architectures that achieved impressive results in several areas of machine learning, including natural language processing (NLP), conversation, image and music generation, and bioinformatics. Transformers rely on a trainable attention module that identifies complex dependencies between the elements of each input sequence. They scale quadratically with the size of the input sequence, which makes them computationally expensive for large inputs. Performers overcome this problem by constructing attention mechanisms that scale linearly – a major breakthrough in improving Transformer models.
Performers provide accurate and unbiased estimate of the softmax-based attention in Transformers. The linear attention module in Performers is implemented using the Fast Attention Via Positive Orthogonal Random Features (FAVOR+) algorithm. This method is of broad interest beyond Transformers as a more scalable replacement for the regular attention. Similar to parallels between Transformers and continuous modern Hopfield networks, the attention module of Performers resembles the update rule of classical Hopfield networks.