TLDR

I built “Spotty” - a natural language interface that enables users to control Boston Dynamics’ Spot robot through voice commands while maintaining spatial awareness and semantic understanding. Using a combination of GraphNav, CLIP, GPT-4o, and custom retrieval systems, I created a system that can navigate to locations, search for objects, and answer questions about its environment - all with $5 in API costs and off-the-shelf components. https://github.com/vocdex/SpottyAI

See it in action:

Introduction

In this project, I set out to explore how modern AI systems could enhance human-robot interaction, specifically for Boston Dynamics’ Spot robot. Spot is probably the most famous robot known to the general public, largely due to BD’s impressive demo videos. After seeing their ChatGPT-powered demo, I asked myself: how hard is it to build something similar from scratch? Is there a secret sauce, or is it just excellent engineering?

Here’s the demo that inspired me:

Spoiler alert: After building this system, I found that while there’s no magical secret ingredient, there are numerous non-trivial engineering challenges in making these components work together reliably. The availability of powerful, accessible AI models has dramatically lowered the barrier to creating sophisticated robot interfaces.

System Architecture

System design diagram for the Spot Interface

The system follows a modular design, ensuring a clear separation of concerns:

Audio Processing: Detects wake words, transcribes speech, and converts responses to text-to-speech.
Vision System: Analyzes camera input to generate scene descriptions and recognize objects.
Navigation: Utilizes Boston Dynamics’ GraphNav for waypoint-based movement, enhanced with semantic annotations.
Multimodal Retrieval-Augmented Generation (RAG): Stores and retrieves contextual information about locations and objects to improve responses and decision-making. Since I don’t have an actual arm for our Spot(it costs an arm and a leg), I focused on exclusively on navigation and perception tasks.

A high-level representation of our system’s workflow:

Data flow diagram showing how information moves through the system

Enhancing Spatial awareness

GraphNav Service: The Solid Foundation

Spot’s SDK includes GraphNav, which provides reliable localization and path planning. GraphNav already makes life easier for us since we don’t have to worry about setting up some third party SLAM package which probably requires ROS integration. Spot’s gRPC-based communication and interface are super reliable.

However, GraphNav doesn’t inherently associate waypoints with meaningful labels or semantic understanding - it’s essentially just a graph of positions and connections.

GraphNav builds a locally consistent graph of nodes (waypoints) and edges (paths), where:

Each node represents a location where the robot can navigate
Each edge contains parameters for navigating between nodes
The system plans paths by chaining together these connections

GraphNav builds paths by chaining transformations, which can lead to different actual destinations when taking different routes

Each waypoint consists of:

a reference frame
a unique ID
annotations
sensor data

What’s really interesting for us is the snapshots that are recorded in each waypoint. GraphNav waypoint snapshots can include AprilTag detections, images, terrain maps, point clouds, etc. (more at Spot documentation).

Why this matters: Without semantic understanding, Spot’s GraphNav can’t comprehend commands like “go to the kitchen” or “find the coffee machine.” It only knows coordinates and paths. We can leverage the images recorded in snapshots to empower Spot with such semantic understanding of the environment.

Making Spot Understand Spaces: Three-Level Annotation

To give Spot semantic understanding of its environment, I built a three-level annotation system:

CLIP-based location labeling: using CLIP vision-lanuage model to classify snapshots at each waypoint belonging to some predefined location (“kitchen”, “hallway”, “office”, “robot lab”, etc.)
GPT 4o-mini scene understanding: for each waypoint, vision model analyzes what it sees from its front cameras, identifying objects and generating short scene descriptions useful for the robot.
Manual annotation updates: since both CLIP and GPT can fail in some scenarios(mentioned below), I add a manual way of waypoint snapshots.

1 .CLIP-based Location Labeling

CLIP is a jack of all trades contrastive vision-language model that was originally released back in 2021. It enabled impressive zero-shot transfer on many downstream tasks such as classification, OCR and video action recognition.

text_embeddings = CLIP_encode_text(["kitchen", "office", "hallway", ...])

def classify_waypoint(waypoint_snapshot):
    votes = {}
    
    for image in waypoint_snapshot.images:
        if valid_camera_image(image):
            image_embedding = CLIP_encode_image(process_image(image))
            similarities = cosine_similarity(image_embedding, text_embeddings)
            best_match = location_labels[argmax(similarities)]
            votes[best_match] += 1
    
    return most_common(votes) if votes else "unknown"

Challenge encountered: Our lab environment had visually similar spaces (white walls everywhere!) that confused CLIP. To address this, I implemented neighbor-based validation:

# Neighbor-Based Label Validation Pseudocode

def validate_label(waypoint_id, predicted_label, confidence):
    # If prediction confidence is high, trust it
    if confidence >= CONFIDENCE_THRESHOLD:
        return predicted_label
    
    # Get connected waypoints from the graph
    neighbors = get_connected_waypoints(waypoint_id)
    if empty(neighbors):
        return predicted_label
    
    # Get labels of neighboring waypoints
    neighbor_labels = [labels[n] for n in neighbors if n in labels]
    if empty(neighbor_labels):
        return predicted_label
    
    # Count occurrences of each label
    label_counts = count_occurrences(neighbor_labels)
    
    # Find most common label among neighbors
    majority_label = find_most_common(label_counts)
    majority_ratio = count(majority_label) / total_neighbors
    
    # If neighbors strongly agree on a different label, use it instead
    if majority_ratio >= NEIGHBOR_THRESHOLD and majority_label != predicted_label:
        log("Correcting waypoint label based on neighbors")
        return majority_label
    
    return predicted_label

This voting mechanism provided impressive improvements in categorization accuracy, especially in ambiguous transition areas like doorways.

2. GPT-4o Visual Scene Understanding

For each waypoint, the GPT-4o vision model analyzed front camera images to:

Identify visible objects
Generate scene descriptions
Anticipate potential questions about the scene

I use Pydantic to enforce consistent formatting:

class WaypointAnnotation(BaseModel):
    visible_objects: List[str]
    hypothetical_questions: List[str] = Field(description="Hypothetical questions about the scene that could be asked")
    scene_description: str = Field(description="A brief description of the scene for robot")

Pro tip: Using structured outputs with LLMs dramatically improved response consistency and reduced post-processing headaches. If you are new into this, I recommend checking out a short course on LLM engineering: Structured outputs by Weights & Biases or just open up the OpenAI Docs.

3. Manual Overrides

For edge cases where automated systems failed, I added a simple interface to manually update annotations. More on this later.

Efficient Retrieval with FAISS

To make retrieval lightning-fast, I use FAISS for vector search:

Camera images at each waypoint are processed by GPT-4o
Annotations are stored in a vector database
User queries are embedded and compared to stored annotations
The system retrieves the most relevant waypoints and objects

Example interaction::

User: "Can you take me to where the coffee maker is?"
System: [Internally searches for "coffee maker" in vector database]
System: [Finds matches in kitchen and break room]
Spot: "I found a coffee maker in two places. Would you prefer the one in the kitchen or the break room?"
User: "Kitchen please."
System: [Navigates to kitchen waypoint with highest coffee maker relevance]

Function calling with voice commands

The heart of the system is the LLM-based function dispatcher that interprets natural language and turns it into robot actions.

Implementation

The key to making this work was providing a precise system prompt for the language model. Moreover, what’s really cool about language models is that they can imitate different persona in their training set. I had some fun asking it to respond like the late comedian-actor Robin williams and it came up with really cool jokes! Here’s the compact version:

system_prompt_assistant = """
You are controlling Spot, a quadruped robot with navigation, speech capabilities, and memory of conversations. You have access to the following API:
Add a bit of humor to your responses to make the conversation more engaging in style of Robin Williams.
Always reply with a single function call.

1. Navigation & Search:
   - navigate_to(waypoint_id, phrase): Move to location while speaking
   - search(query): Search environment using scene understanding

2. Interaction:
   - say(phrase): Speak using text-to-speech
   - ask(question): Ask question and process response

3. Visual Understanding:
   - describe_scene(query): Analyze current camera views

4. Control your stance:
   - sit()
   - stand()

Possible locations: kitchen, office, hallway, study room, robot lab, base station

Real interaction example::

User: "Hey Spot, can you find me a coffee mug?"
Spot: [Internal function call: search("coffee mug")]
Spot: "I'll hunt down that elusive coffee mug faster than Robin Williams could change characters! Let me check around..."
[Spot navigates to kitchen waypoint where mugs were detected]
Spot: "Nano nano! Found some mugs over here in the kitchen. Ready for your caffeine adventure, Captain?"

Context tracking

To give Spot short-term memory, I let the spot conversation agent keeps track of the last 10 conversation turns. This served two critical purposes:

State awareness: Tracking Spot’s current location and previous actions
Conversation coherence: Allowing reference to previous topics and user preferences

This made interactions feel much more natural and enabled multi-turn tasks like “Go to the kitchen, then find me a mug.”

Results

Here’s how our recorded map structure looks like ( without labels):

Unedited Recorded Map by GraphNav

When searching for a “coffee maker” in the map, the system highlights the corresponding waypoint node in red:

Object Filter Results For "coffee maker"

The front camera images at that waypoint:

Camera Views For The Filtered Waypoint Matching "Coffee Maker"

It’s remarkable that GPT-4o could reliably detect objects in these low-quality grayscale images - far better than I expected :)

What Worked Well

Voice command interpretation: 90%+ of natural language commands were correctly mapped to functions
Object search: Spot successfully found objects like “coffee maker,” “fire extinguisher,” and “desk chair”
Navigation: Location-based commands like “go to the kitchen” worked reliably
Conversation coherence: Spot maintained context across multiple turns

Challenges and Solutions

GraphNav localization failures
- Problem: Occasionally, Spot would lose track of its position. Probably more work to be done with GraphNav
- Solution: Added more AprilTag fiducials in feature-poor areas, especially around corners
Navigation precision issues
- Problem: Humans and robots have different expectations about “at” a location
- Solution: Added “approach points” to get Spot closer to objects of interest
Function calling errors
- Problem: Occasional misinterpretation of complex requests
- Solution: Added more examples to the system prompt and implemented request reformulation

Conclusion

This project demonstrated that building a conversational robot interface is now within reach of individual developers, thanks to the accessibility of powerful AI models. The entire development cost me just $5 in API calls - astonishing considering what would have been required even two years ago.

So, is there a “secret sauce” in Boston Dynamics’ demos? Not exactly - but there is expert engineering to handle countless edge cases and ensure reliability. The fundamental components are now widely available, but making them work seamlessly together remains challenging.

The most surprising discovery was how effectively GPT-4o could understand and reason about low-quality robot camera images. This suggests we’re entering an era where robots can have much more sophisticated environmental understanding without specialized hardware.

Future Directions

If I were to continue this project, I’d focus on:

Adding dynamic map updating as environments change
Implementing multi-task planning for complex requests
Incorporating more physical interaction capabilities (when I can afford that arm!)

I hope this project inspires others to experiment with accessible robot interfaces. The future of human-robot interaction doesn’t need to be locked behind corporate resources - it’s becoming increasingly democratized.

Thanks for reading! If you’ve made it this far, I appreciate your attention and would love to hear your thoughts

Acknowledgements

Big thanks to Automatic Control Chair at FAU for providing the robot for my semester project
Many thanks to Boston Dynamics’ engineers for their work on Spot SDK
HuggingFace, OpenAI, Facebook Research