SENT-Map: Semantically Enhanced Topological Maps with Foundation Models

University of Minnesota
ICRA 2025: Workshop on Foundation Models and Neuro-Symbolic AI for Robotics
Workshop Paper Code
Teaser Img

SENT-Map framework creates a semantically enhanced topological map of the environment that is human readable and editable using Foundation Model (FM). Using SENT-Map a LLM can produce accurate high level motion plan for robots.

Abstract

We introduce SENT-Map, a semantically enhanced topological map for representing indoor environments, designed to support autonomous navigation and manipulation by leveraging advancements in foundational models (FMs). Through representing the environment in a JSON text format, we enable semantic information to be added and edited in a format that both humans and FMs understand, while grounding the robot to existing nodes during planning to avoid infeasible states during deployment. Our proposed framework employs a two stage approach, first mapping the environment alongside an operator with a Vision-FM, then using the SENT-Map representation alongside a natural-language query within an FM for planning. Our experimental results show that semantic-enhancement enables even small locally-deployable FMs to successfully plan over indoor environments.

Experiments

Experiments Overview

Get Sponge

The task is an unambiguous reasoning instance where the target object (a sponge) is placed in its logically and semantically appropriate location—i.e., in the kitchen by the sink. This makes it relatively straightforward for LLMs to infer that the kitchen sink is the obvious place to search for the sponge.

Get Coffee

The task introduces a misleading association: the coffee powder is placed on a tray in the office rather than in the lounge or kitchen. This reflects a common real-world scenario where an object is not in a semantically expected location, requiring the agent to rely on a prior mapping phase to identify its placement.

Get Tissue

The task presents a many-to-one mapping scenario, where multiple tables (at least one per location) could plausibly contain a tissue box, but only one actually does. This setting reflects a frequent occurrence in household environments, where several semantically valid locations might exist for a given object. In such cases, an accurate map with the object's current location is essential for efficient task completion.

Experimental Results


Various FMs

Model Baseline Semantic Enhancement
Sponge Coffee Tissue Average Sponge Coffee Tissue Average
Gemma 3 27B × 66.7% 100%
Gemini 2.0 Flash 0.0% 100%
Llama 3.1 8B × × 33.3% 100%
Llama 3.1 405B × × 33.3% 100%
GPT 4o mini × × 33.3% 100%
GPT o3 mini × 66.7% 100%
Average 38.9% 100%

Task success across several FMs. A “✓” denotes task success, an “×” denotes task failure, and a “∅” denotes the model’s refusal to output a solution due to requiring additional context. Results indicate that semantic enhancement is critical for FMs to correctly localize objects and plan.


Indirect Tasks with Gemma 3 27B

Task Baseline Semantic Enhancement
Direct Indirect Direct Indirect
Runny Nose ×
Watch TV × ×
Private Listening × ×
Sanitization × ×
Call a friend × ×
Flavor Coffee ×

Direct-query and indirect-query task success for small foundation model. Gemma 3 27B was prompted with two types of queries, one directly asking for the objects, and one indirectly suggesting the object without naming it. Results indicate that semantic enhancement enables even a small FM to reason about complex tasks.


Human Semantic Enhancement with Gemma 3 27B

Task Semantic Enhancement Semantic Enhancement w/ Ownership
Direct Indirect Direct Indirect
Store Bob's Leftovers in the Fridge
Get Bob his drink ×
Bring Bob's things to Alice ×

Here, two people, Bob and Alice, were added to the SENT-Map. Gemma 3 27B was prompted with direct and indirect queries relating to Bob's belongings. We test over two SENT-Map configurations: one with basic Semantic Enhancement, and one with Semantic Enhancement + Ownership tagging. When ownership was not tagged, asking indirectly for objects Bob owns resulted in FM halucinations, where the model would confidently guess which items were Bob's, while having the tags available allowed the model to plan correctly.

Results suggest that as FMs are not pre-trained to know the belongings of a person, we must guide their reasoning to prevent halucinations.


Get Coffee Task

Additional Image

Get Tissue Task

Get Tissue Task Image