This document outlines a lecture on transformer models applied to vision tasks, discussing their background, architecture, and applications in image classification. It highlights the transition from traditional convolutional networks to transformers, the challenges of self-attention for visual data, and innovations such as local and axial self-attention. Additionally, it addresses scaling laws and the performance of vision transformers compared to traditional models like ResNet.