The document details a study on using a U-Net architecture with convolutional LSTMs for generating depth maps from sequences of images, motivated by advancements in hardware that enable more complex video analysis tasks. The authors researched existing literature on depth maps and explored various architectures, ultimately focusing on the Kitti dataset for training and testing their models. Initial results indicate promise in their approach, with discussions on model variations and future work outlined.