We introduce SENT-Map, a semantically enhanced topological map for representing indoor environments, designed to support autonomous navigation and manipulation by leveraging advancements in foundational models (FMs). Through representing the environment in a JSON text format, we enable semantic information to be added and edited in a format that both humans and FMs understand, while grounding the robot to existing nodes during planning to avoid infeasible states during deployment. Our proposed framework employs a two stage approach, first mapping the environment alongside an operator with a Vision-FM, then using the SENT-Map representation alongside a natural-language query within an FM for planning. Our experimental results show that semantic-enhancement enables even small locally-deployable FMs to successfully plan over indoor environments.
The task is an unambiguous reasoning instance where the target object (a sponge) is placed in its logically and semantically appropriate location—i.e., in the kitchen by the sink. This makes it relatively straightforward for LLMs to infer that the kitchen sink is the obvious place to search for the sponge.
The task introduces a misleading association: the coffee powder is placed on a tray in the office rather than in the lounge or kitchen. This reflects a common real-world scenario where an object is not in a semantically expected location, requiring the agent to rely on a prior mapping phase to identify its placement.
The task presents a many-to-one mapping scenario, where multiple tables (at least one per location) could plausibly contain a tissue box, but only one actually does. This setting reflects a frequent occurrence in household environments, where several semantically valid locations might exist for a given object. In such cases, an accurate map with the object's current location is essential for efficient task completion.
Model | Baseline | Semantic Enhancement | ||||||
---|---|---|---|---|---|---|---|---|
Sponge | Coffee | Tissue | Average | Sponge | Coffee | Tissue | Average | |
Gemma 3 27B | ✓ | × | ✓ | 66.7% | ✓ | ✓ | ✓ | 100% |
Gemini 2.0 Flash | ∅ | ∅ | ∅ | 0.0% | ✓ | ✓ | ✓ | 100% |
Llama 3.1 8B | × | × | ✓ | 33.3% | ✓ | ✓ | ✓ | 100% |
Llama 3.1 405B | × | × | ✓ | 33.3% | ✓ | ✓ | ✓ | 100% |
GPT 4o mini | × | ✓ | × | 33.3% | ✓ | ✓ | ✓ | 100% |
GPT o3 mini | ✓ | ✓ | × | 66.7% | ✓ | ✓ | ✓ | 100% |
Average | 38.9% | 100% |
Task success across several FMs. A “✓” denotes task success, an “×” denotes task failure, and a “∅” denotes the model’s refusal to output a solution due to requiring additional context. Results indicate that semantic enhancement is critical for FMs to correctly localize objects and plan.
Task | Baseline | Semantic Enhancement | ||
---|---|---|---|---|
Direct | Indirect | Direct | Indirect | |
Runny Nose | ✓ | × | ✓ | ✓ |
Watch TV | × | × | ✓ | ✓ |
Private Listening | × | × | ✓ | ✓ |
Sanitization | × | × | ✓ | ✓ |
Call a friend | × | × | ✓ | ✓ |
Flavor Coffee | ✓ | × | ✓ | ✓ |
Direct-query and indirect-query task success for small foundation model. Gemma 3 27B was prompted with two types of queries, one directly asking for the objects, and one indirectly suggesting the object without naming it. Results indicate that semantic enhancement enables even a small FM to reason about complex tasks.
Task | Semantic Enhancement | Semantic Enhancement w/ Ownership | ||
---|---|---|---|---|
Direct | Indirect | Direct | Indirect | |
Store Bob's Leftovers in the Fridge | ✓ | ✓ | ✓ | ✓ |
Get Bob his drink | ✓ | × | ✓ | ✓ |
Bring Bob's things to Alice | ✓ | × | ✓ | ✓ |
Here, two people, Bob and Alice, were added to the SENT-Map. Gemma 3 27B was prompted with direct and indirect queries relating to Bob's belongings. We test over two SENT-Map configurations: one with basic Semantic Enhancement, and one with Semantic Enhancement + Ownership tagging. When ownership was not tagged, asking indirectly for objects Bob owns resulted in FM halucinations, where the model would confidently guess which items were Bob's, while having the tags available allowed the model to plan correctly.
Results suggest that as FMs are not pre-trained to know the belongings of a person, we must guide their reasoning to prevent halucinations.