Alexander Kolesnikov is a researcher in the Google Brain team. His current research
interests include visual representation learning and data-efficient adaptation
algorithms. Previously, Alexander obtained a MSc degree in applied mathematics and
programming from the Moscow State University and a PhD degree in computer science
from IST Austria.
Convolutional Neural Networks (CNNs) have been dominating the field of computer vision for almost a decade. In this talk I will present two recent papers that propose new and highly competitive architecture classes for computer vision. In the first part I will present the Vision Transformer model (ViT), which is almost identical to the standard transformer model used in natural language processing, but happens to work surprisingly well for vision applications. In the second part of the talk, I will present the MLP-Mixer model: an all-MLP architecture for vision. It can be seen as a simplified ViT model without self-attention layers. Nevertheless, it also demonstrates strong results across a wide range of vision applications.