This document proposes a vision transformer-based convolutional neural network approach for Indian sign language recognition using hand gestures. It aims to improve on traditional machine learning and CNN techniques. The proposed method achieves 99.88% accuracy on a test image database, outperforming state-of-the-art methods. An ablation study also supports that convolutional encoding increases accuracy for hand gesture recognition. The document discusses the challenges of existing data glove and vision-based techniques for hand gesture recognition and human-computer interaction. It aims to develop a more natural and accessible method using computer vision and deep learning.