Vidi is a family of large multimodal models developed for deep video understanding and editing tasks, integrating vision, audio, and language to allow sophisticated querying and manipulation of video content. It’s designed to process long-form, real-world videos and answer complex queries such as “when in this clip does X happen?” or “where in the frame is object Y during that moment?” — offering temporal retrieval, spatio-temporal grounding (i.e. locating objects over time + space), and even video question answering. Vidi targets applications like intelligent video editing, automated video search, content analysis, and editing assistance, enabling users to efficiently locate relevant segments and objects in hours-long footage. The system is built with open-source release in mind, giving developers access to model code, inference scripts, and evaluation pipelines so they can reproduce research results or integrate Vidi into their own video-processing workflows.
Features
- Multimodal video understanding: processes video + audio + possibly metadata/text to answer complex queries
- Temporal retrieval: identifies time ranges in long videos corresponding to given text queries
- Spatio-temporal grounding: finds bounding boxes of target objects across time when relevant
- Video question answering: supports QA over video content rather than only retrieval or segmentation
- Open-source release with model code, inference scripts, and evaluation pipelines — reproducible research and integration-friendly
- Designed for long-context videos — capable of handling extended footage instead of only short clips