Learning Category-level Last-meter Navigation from RGB Demonstrations of a Single-instance

University of Minnesota,

Last-meter navigation enables robots to achieve manipulation-ready positioning, bridging the critical gap between global navigation and manipulation.

Last-Meter Navigation

Object-centric imitation learning framework that solves the last-meter navigation problem and produces manipulation-ready base poses with centimeter-level accuracy.

One-Instance Transfer

Demonstration of strong instance-to-category generalization, where a model trained on a single object instance reliably transfers to unseen objects of the same category.

RGB-Only Perception

Real-world validation that precise last-meter navigation is achievable using only onboard RGB observations, without depth, LiDAR, or map priors.

Abstract

Achieving precise positioning of the mobile manipulator's base is essential for successful manipulation actions that follow. Most of the RGB-based navigation systems only guarantee coarse, meter-level accuracy, making them less suitable for the precise positioning phase of mobile manipulation.

This gap prevents manipulation policies from operating within the distribution of their training demonstrations, resulting in frequent execution failures. We address this gap by introducing an object-centric imitation learning framework for last-meter navigation, enabling a quadruped mobile manipulator robot to achieve manipulation-ready positioning using only RGB observations from its onboard cameras.

Our method conditions the navigation policy on three inputs: goal images, multi-view RGB observations from the onboard cameras, and a text prompt specifying the target object. A language-driven segmentation module and a spatial score-matrix decoder then supply explicit object grounding and relative pose reasoning. Using real-world data from a single object instance within a category, the system generalizes to unseen object instances across diverse environments with challenging lighting and background conditions. To comprehensively evaluate this, we introduce two metrics: an edge-alignment metric, which uses ground truth orientation, and an object-alignment metric, which evaluates how well the robot visually faces the target. Under these metrics, our policy achieves 73.47% success in edge-alignment and 96.94% success in object-alignment when positioning relative to unseen target objects. These results show that precise last-meter navigation can be achieved at a category-level without depth, LiDAR, or map priors, enabling a scalable pathway toward unified mobile manipulation.

Sequential Last-meter Navigation for Mobile Manipulation

Last-meter navigation enables the robot to navigate effectively between different objects, facilitating sequential multi-stage mobile manipulation tasks. By chaining last-meter navigation policies, the robot can transition from one workspace to another with high precision.

Robustness to Dynamic Environments

Our method demonstrates robustness to dynamic environments, successfully adjusting to target objects that are constantly moving during operation. The system continuously re-evaluates the target's pose relative to the robot, adjusting the approach path in real-time.

Last-meter Navigation

Last-meter Navigation Concept Overview

Last-meter navigation is the stage between global path planning and manipulation in which the robot must achieve centimeter-level positional and degree-level orientation accuracy relative to a target. Whereas global navigation often deems success as stopping within about one meter, manipulation policies operate reliably only under much tighter alignment, and this mismatch causes many mobile manipulation failures. Last-meter navigation addresses this gap by explicitly focusing on the final meter of motion so that the robot arrives in a manipulation-ready pose.

In the example on the left: (1) global navigation first drives the robot near the target; (2) once the target object (e.g., the orange chair) is detected, our policy is invoked; and (3) last-meter navigation adjusts the robot’s base to a precise manipulation-ready pose defined by a goal observation.

Methodology

System Architecture

Architecture Overview. At each timestep, the model receives current and goal observations. A segmentation module (driven by a language prompt) generates object masks. The action decoder uses a spatial score-matrix to predict discrete actions (Forward, Lateral, Rotate).

Quantitative Results

Part 1 Results
Part 2 Results

Our system (DinoScoreAux) achieves the highest success rate among baselines, demonstrating the importance of the spatial score-matrix decoder and auxiliary stopping mechanism.