This document presents a study on optimizing deep learning inference in constrained embedded devices using hardware looping and loop unrolling techniques with a focus on a convolutional neural network (CNN). It evaluates the performance improvements achieved through these techniques on a Lenet-5 model implemented on a Zynq-7000 FPGA, demonstrating significant reductions in cycle count. The findings suggest that hardware loops and dot product instructions are effective in accelerating deep learning functions in resource-limited environments, setting the stage for future advancements in neural networks for embedded systems.