PiCrawler AI — Gideon Glago

Zeus, front view, with its Raspberry Pi 5 and sensors visible on top — Zeus at rest. SunFounder PiCrawler chassis, Pi 5 + Robot HAT on top, twin ultrasonic sensors up front.

Zeus viewed from above, all four legs visible — Zeus at rest. SunFounder PiCrawler chassis, Pi 5 + Robot HAT on top, twin ultrasonic sensors up front.

Zeus rising onto its legs — Leg movement, audio removed.

Zeus rocking side to side on its legs — Leg movement, audio removed.

Walking gait, audio removed.

01 — Context

Most “AI assistant” products are cloud-tethered: a microphone in a room and a fast pipe to someone else's GPU. That model is fine when you have bandwidth and trust the vendor. It's wrong when neither is reliably true — which describes a lot of places, and describes most edge devices.

PiCrawler AI is a small physical demonstration of the alternative. It's a four-legged spider robot, sold by SunFounder as a hobbyist kit, taught to be an always-on conversational assistant whose entire AI stack — speech recognition, language model, vision model, text-to-speech — runs locally on the Raspberry Pi 5 inside the chassis. No internet, no API keys, no telemetry required. The robot walks, follows faces, listens for a wake word, and talks back, with everything happening on-device. There's an optional OpenAI fallback for when it's online and a sharper answer is worth the round trip, but the default path never leaves the board.

The motivation was partly curiosity (how much of the cloud AI experience could you actually replicate locally on a $200 single-board computer in 2026), partly skill development (I wanted to learn Vosk, Piper, and the Ollama runtime hands-on), and partly because a robot that turns its head to look at you when you walk into the room is more delightful than any web app I've ever built.

02 — Approach

The robot's behavior is structured as a state machine running across multiple concurrent threads on top of the SunFounder PiCrawler chassis. The hardware loop is straightforward — a Pi 5, a Robot HAT for servo control, a Pi camera, an ultrasonic distance sensor, four legs each driven by three servos. The interesting work is the software stack that turns that hardware into something that feels like an aware presence in the room.

                    ┌────────────────────┐
                    │     Pi Camera      │
                    └─────────┬──────────┘
                              │
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
        ┌──────────┐    ┌──────────┐    ┌──────────┐
        │  Face    │    │ Gesture  │    │ Vision   │
        │ tracking │    │ (vilib)  │    │ (Mood-   │
        │ thread   │    │          │    │  ream)   │
        └────┬─────┘    └────┬─────┘    └────┬─────┘
             │               │               │
             └───────────────┼───────────────┘
                             ▼
                  ┌──────────────────────┐
                  │   STATE MACHINE      │
                  │  IDLE  →  LISTENING  │
                  │   ▲           ▼      │
                  │  SPEAKING ← THINKING │
                  └──────────────────────┘
                             ▲
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
        ┌──────────┐   ┌──────────┐   ┌──────────┐
        │ Vosk STT │   │  Ollama  │   │  Piper   │
        │ wake     │   │ qwen2.5  │   │   TTS    │
        │ word     │   │  1.5B    │   │          │
        └──────────┘   └──────────┘   └──────────┘

Three concurrent threads keep the robot responsive: a face-tracking loop that rotates the body to keep detected faces near the camera center (with a deadzone to prevent jitter), a microphone thread running Vosk for offline wake-word detection — it answers to “computer,” “spider,” “zeus,” or “picrawler” — and a behavior loop that periodically performs idle actions when nothing else is happening, the kind of small motion that makes the robot feel alive rather than waiting.

When the wake word fires, the state machine transitions to LISTENING, captures speech via Vosk, optionally captures a snapshot from the camera, and sends both to a local Ollama process running qwen2.5:1.5b for conversation and moondream:1.8b for vision. The model returns a structured JSON response containing both an action list (movements to perform, things like wave_hand, push_up, nod) and a spoken answer. The action list is dispatched to the chassis; the spoken answer is streamed sentence by sentence through Piper for TTS, so it starts speaking before it's finished thinking. Then the state returns to IDLE.

The single-binary local inference story is genuinely nice. A 1.5-billion-parameter model running on the Pi 5 is slower to answer than a cloud API would be, but fast enough to stay inside the patience window for something you're having a back-and-forth with. Moondream is small enough to run alongside it without thrashing. The whole pipeline lives on the SD card.

03 — Tradeoffs

The hardest part of this project was nothing to do with AI. It was power. The Robot HAT is finicky about power delivery — the Pi will brown out and crash if servo current spikes while the board is underfed, and the failure mode is silent (red LED, dropped SSH session, no useful logs). Getting to a stable always-on configuration meant running both the Pi over USB-C from a wall charger and the servo battery on the Robot HAT, with the Robot HAT switch in the right position. The “right” answer is in the SunFounder docs but only if you read them more carefully than I did the first time. Zeus now checks its own voltage before and after every move and aborts if it looks sketchy, and there are two kill-switch files for bench testing without the robot thrashing around while you're working on something else.

The model choice was a deliberate downscale. Qwen2.5 1.5B is not the best conversational model in the world; it's small enough to share the Pi's 8GB of RAM comfortably with a vision model and a speech recognizer running at the same time. A bigger model would produce sharper answers but would also start swapping, and a robot that takes thirty seconds to respond stops being a robot and starts being a slightly cute terminal.

The other tradeoff worth flagging is that this is a hobby-grade build, not a product. The chassis is a kit. The software is mine. There's no path from this to a shipped product — SunFounder already sells the chassis, and the on-device AI stack is mostly other people's open-source models behind a glue layer I wrote. That's fine. Not every project needs to be commercializable, and learning Vosk + Piper + Ollama hands-on is a credential I can use elsewhere.

04 — Outcome

The robot works. It tracks faces. It responds when you say “computer,” “spider,” “zeus,” or “picrawler.” It walks. It will, on request, do a small dance, look around, or describe what it sees on its camera. The full software stack is offline-capable — pull the network cable and nothing breaks except the model downloads I've already finished doing.

The biggest win was learning that the local-AI dream is closer to real than the marketing for cloud AI suggests. A 1.5-billion parameter model running on a $200 board can carry a conversation. That's a useful piece of evidence to have personally calibrated.

05 — What's next

Nothing major planned — the project hit its goal. The swap I'd flagged earlier, moving to a smaller and faster model for better conversational latency, already happened: qwen2.5:1.5b is what runs today. What's actually queued up is mundane — more demo footage (waving, push-ups, the slightly-too-attentive face-tracking head swivel) rather than new capability. If I picked this back up for real, it'd probably be a second camera for proper depth perception, but that's not worth doing for its own sake. The robot lives on a shelf in my room and waves at me when I walk in.