Prompt Then Ground:
Task-Aware Scene Understanding
via Online Neural Implicit Mapping

[1]Institute for Al Industry Research (AIR), Tsinghua University, [2]Georgia Institute of Technology,
[3]Technical University of Munich

Video presentation for ProGround.

Abstract

Open-vocabulary scene understanding is crucial for robotic applications, involving locating targets from 3D semantic scene representations given queries. However, existing mapping approaches often focus on task-agnostic representations, suffering from inaccurate semantic supervision due to noisy and ambiguous perception. We introduce ProGround, a prompt-then-ground framework for online neural implicit mapping that reshapes the data distribution to prioritize task-relevant and high-confidence semantic features. We exploit in-network aggregation of local-global feature pyramids with sufficient context information. To ensure fast optimization with accurate reasoning, we probe semantics from concatenated features of positional and color embedding and employ a selective experience replay mechanism for continual learning with forgetting avoidance. Evaluated on Habitat-Sem, ProGround achieves a +4.36\% improvement in Top-1 semantic accuracy over state-of-the-art methods while maintaining memory efficiency. Applications for robotic navigation reveal great potentials with the proposed paradigm.

Method Pipeline

method

The feature extraction process generates pixel-wise semantic features along with scores from visual observations. Afterwards, the neural field is optimized with sparse samples from instant observations and replayed experiences.

Results Gallery

Text Query Visualization

text query visualization

Examples of successful query localization. Each visualized result highlights the model's ability to accurately localize both common and fine-grained object categories.

Uncertainty Visualization

uncertainty visualization

Semantic uncertainty estimation across scenes. For each pair, left: reconstructed RGB scene, right: predicted semantic uncertainty. High uncertainty (red) often aligns with unsupervised regions. These maps help identify where supervision is lacking and improve downstream query reliability.

Realtime Query Panel

We provide user friendly query panel for real-time interaction.

ProGround Maps for Robotic Navigation

We further showcase ProGround's applicability in robotic navigation. Given a pre-constructed task-oriented feature field, we first query the target instances from field. To ensure efficient path planning, we extract a Voronoi graph from dense field. The robot then navigates toward the most confident regions, following a path planned by Voronoi graph.