How We Built a Voice-First Enterprise Agent with the Gemini Live API

The Problem Nobody Talks About

Here's a stat that haunts us: 90% of people use only 10% of deep or complex platforms. We've spent a decade building TillTech, a comprehensive business management platform for hospitality — EPOS, delivery logistics, kitchen displays, marketing automation, staff management, the works. Powerful tools, all connected. And yet, the operators running busy restaurants at 6pm on a Friday evening don't have time to click through nested menus to find the stock reorder button.

The tools exist. The data is there. But the interface gets in the way.

When we saw the Gemini Live Agent Challenge, we knew immediately what we wanted to build: what if the entire interface was just... a conversation?

Meet Tilly

Tilly Live Ops is a pure-voice enterprise orchestrator. No keyboard. No mouse. You talk to Tilly, and she acts.

Ask her for a driver update — she pulls live data and populates a visual dashboard. Tell her to send a customer apology — she fires the SMS and logs it. Say "let's draft an email campaign, 20% off fish and chips" — she generates a branded email preview complete with an AI-generated image, right there on screen, while you keep talking.

The magic is in the "while you keep talking" part.

The Dual-Layer Architecture

The biggest technical challenge was a classic concurrency problem: how do you make a real-time audio stream pause, think, generate structured JSON for tool calls, and resume — without the user noticing?

The answer: you don't. You separate the layers.

Layer 1: The Voice Stream

We used Gemini 2.5 Flash with native audio (gemini-2.5-flash-native-audio-preview-12-2025) over WebSockets for real-time, bidirectional audio. This is Tilly's mouth and ears. The operator speaks naturally, Tilly responds with a warm, contextual voice — no text-to-speech, no robotic pauses. The model handles the conversation, and can trigger function calls (tool use) directly during the audio stream.


const session = await client.aio.live.connect({

model: 'gemini-2.5-flash-native-audio-preview-12-2025',

config: {

responseModalities: ['AUDIO'],

speechConfig: { voiceConfig: { prebuiltVoiceConfig: { voiceName: 'Aoede' } } },

tools: [{ functionDeclarations: [...] }]

}

});

Layer 2: The Action Engine

Simultaneously, a background text model handles structured planning when additional intelligence is needed. When the audio model's function calling doesn't fire (it's a preview model — it's not always reliable), our 3-tier SmartPlan system catches it:

Tier 1 — The audio model calls a tool directly via function_call. Instant.
Tier 2 — We scan Tilly's spoken output for confirmation phrases ("I've drafted that email", "SMS sent"). When detected, we extract dynamic data and trigger the action.
Tier 3 — We match the operator's keywords as a last-resort fallback. Info queries fire immediately; action queries wait for confirmation.

This layered approach means something always catches the intent, even when the bleeding-edge model has an off moment.

Layer 3: The Image Generator

For email campaigns, we use Gemini 3.1 Flash Image (gemini-3.1-flash-image-preview) to generate branded campaign visuals on the fly. The operator describes the promotion in natural speech; Tilly generates a complete mobile email preview — header, hero image, CTA button, the lot — in seconds.

The Stack

Everything runs on Google Cloud with a deliberately lean stack:

Component	Technology
Voice	Gemini Live API (native audio over WebSockets)
Planning	Gemini 3.1 Flash (structured JSON output)
Images	Gemini 3.1 Flash Image (multimodal generation)
Backend	Node.js with native `node:http` — no frameworks
Frontend	React 19 + Vite
SDK	`@google/genai` v1.45
Deployment	Google Cloud Run via Terraform
Only dependency	`@google/genai` — that's literally it on the server

Yes, you read that right. The server's package.json has exactly one production dependency: the Google GenAI SDK. Everything else is Node.js built-ins. We wanted to prove that you don't need a framework zoo to build something sophisticated — you need a good model and clean architecture.

The UI: Keyboard-Less, Not Screen-Less

One thing we're particularly proud of is the UI. Tilly is voice-first, but she's not voice-only. The screen is a live operations canvas:

The Orb — a central pulsing visualisation that shows Tilly's state (listening, speaking, processing, acting)
Dashboard Panels — six operational domains (Drivers, Inventory, Kitchen, Marketing, Customers, Staff) that populate with live data as you talk
Viz Cards — animated centre-stage cards that show actions in progress, complete with data, status indicators, and AI-generated images
Action Timeline — a timestamped log of every action taken during the session

The key insight: by implementing an event-driven state machine with optimistic UI placeholders, the interface updates seamlessly alongside the voice stream. Cards animate in while Tilly is still talking. Data populates while you're already asking the next question. It proves that a keyboard-less app can still provide deep visual context without the clutter.

What We Learned

Plan for model inconsistency. The Gemini Live native audio model is incredible, but it's a preview. Function calls don't always fire. Audio streams can drop. Our 3-tier detection system was born from necessity — and it turned out to be a genuinely robust architectural pattern.

Separate your streams. Trying to make one model do everything — conversation, tool calling, image generation, structured output — is fragile. Separating the voice layer from the action layer from the image layer gave us reliability where a single-model approach would have been brittle.

Voice changes everything. When you remove the keyboard, you remove the learning curve. Features that operators never found in menu hierarchies suddenly become accessible when they can just say "check the rotas" or "halt garlic bread prep." The platform becomes immediately useful to the other 90%.

What's Next

Tilly is actually a slice of a much deeper stack — an orchestrator-based system with field-relative experts and specific action sub-agents. We have features like full email designers built out, where Tilly uses them to actually build things from scratch, component by component. This approach works for brand websites, mobile apps, self-service kiosks, kitchen displays, warehouse systems, and more.

For this hackathon, we wanted to showcase how you can take a conversational orchestrator approach and use it to take meaningful, complex actions within your business. How deep that rabbit hole goes is up to you.

Built with the Gemini Live API for the Gemini Live Agent Challenge

#GeminiLiveAgentChallenge

Tilly and the Gemini Live Hackathon