This keynote talk discusses how computer vision is transforming from traditional convolutional neural networks (CNNs) to vision transformers (ViTs). ViTs break images down into patches that are fed into a transformer encoder, similar to how text is handled with word embeddings. This approach performs competitively with CNNs while being conceptually simpler. The talk outlines the architecture of ViTs and how they function, noting they ignore convolutions and analyze variants' significance. It encourages attendees to start exploring ViTs through an online tutorial and contacts the speaker for additional help.