Towards Practical Vision-and-Language Navigation Systems Through 3D Referential Grounding - Robotics Institute Carnegie Mellon University
Loading Events

MSR Thesis Defense

July

11
Fri
Nader Zantout MSR Student Robotics Institute,
Carnegie Mellon University
Friday, July 11
11:00 am to 12:30 pm
Newell-Simon Hall 4305
Towards Practical Vision-and-Language Navigation Systems Through 3D Referential Grounding

Abstract:

As robots transition toward practical deployment as collaborative agents in human environments, it becomes essential to improve language-conditioned environmental understanding. A vision-and-language navigation (VLN) system must adapt to both the types of language used and the actions expected by a human collaborator. Often, a single sentence containing spatial relations and semantic attributes—e.g., “fetch the yellow bottle on the table”—is all that is provided to specify a target object in a complex scene. The task of identifying the correct object from such a statement is known as 3D referential grounding.

This thesis develops and deploys a practical VLN system through the lens of 3D referential grounding, a particularly challenging task due to the large number of objects in typical scenes and the relative scarcity of 3D data compared to 2D. We pursue two complementary approaches: (1) scaling up the training of an end-to-end 3D referential grounding model, and (2) decomposing the task into a modular pipeline.

First, we introduce IRef-VLA, a large-scale benchmark for Interactive Referential Vision-and-Language-guided Action. IRef-VLA aims to improve generalization in 3D referential grounding using synthetic utterances generated from scene graphs with view-independent spatial relations. Baseline models trained on IRef-VLA show strong zero-shot transfer performance, and an LLM-based graph search baseline achieves high grounding accuracy, motivating a modular alternative to end-to-end approaches.

We then explore this modular approach in SORT3D, a Spatial Object-centric Reasoning Toolbox for 3D grounding with foundation models. SORT3D combines real-time semantic mapping, vision-language captioning, query-based object filtering, and structured spatial reasoning via LLMs into a deployable system. It demonstrates strong zero-shot performance across two benchmark datasets and on real-world robotic platforms operating in unseen environments.

Together, these systems establish a template for building effective collaborative embodied agents, where the ideal model is a middle ground between fully end-to-end learning and a fully heuristics-based approach, and act as a springboard towards the creation of general purpose VLN systems deployable in all environments.

Committee:
Ji Zhang (advisor)

Wenshan Wang (co-advisor)
Jean Oh
Ayush Jain