Convolutional Neural Networks (CNNs) have been dominating the field of computer vision for almost a decade. In this talk I will present two recent papers that propose new and highly competitive architecture classes for computer vision. In the first part I will present the Vision Transformer model (ViT), which is almost identical to the standard transformer model used in natural language processing, but happens to work surprisingly well for vision applications. In the second part of the talk, I will present the MLP-Mixer model: an all-MLP architecture for vision. It can be seen as a simplified ViT model without self-attention layers. Nevertheless, it also demonstrates strong results across a wide range of vision applications.