Skip to content
Voice AI

Voice Agent

Voice Agent is a production-grade conversational AI proof of concept that demonstrates a sub-1.2-second round-trip from end-of-utterance to first audio byte — the latency budget that separates a natural-feeling agent from one that feels broken. The platform combines LiveKit Cloud (WebRTC SFU + Python agent worker), a LangGraph state machine for conversation flow, FastMCP for tool exposure, Sarvam for Indian-language speech, Groq for English speech, and OpenAI for language understanding. PostgreSQL persists session memory; Redis holds live state. The architecture follows the latency budget, silence-handling, and conversation-design patterns from the Building Intelligent Voice Agents guide.

Detail-page tour · LiveKit + LangGraph + multi-language voice stack

<1.2s
Round-trip latency
<1.1s
Max silence
Multi-lang
EN + Indian
WebRTC
Audio transport
What you can do

Core capabilities

Hands-on features available when you launch the live demo.

Sub-1.2s round-trip latency

End-to-end pipeline tuned to keep total time from end of caller's utterance to first audio byte of the agent's response under the 1.2-second budget that separates natural conversation from a broken-feeling agent.

Filler-phrase handler against silence

When the language model is slow, a brief phrase like 'let me check on that for you' fills the gap without breaking flow. The agent never goes silent for more than 1.1 seconds — silence in a phone call is an error signal, not a neutral state.

LangGraph conversation state machine

Multi-turn dialog managed as an explicit DAG with named nodes (greet, gather, confirm, fulfill, recover) instead of free-form prompts — predictable, testable, and easier to evolve as flows grow.

FastMCP tool exposure

External tools (lookups, bookings, escalations) are surfaced through FastMCP so the agent can call them mid-conversation with bounded latency and structured arguments.

Multi-language speech (Sarvam + Groq)

Sarvam handles Indian-language voice (Hindi, Telugu, Tamil, etc.); Groq handles English STT with sub-200ms first-token. Voice routing is transparent to the language model.

OpenAI for language understanding

Conversation reasoning, intent extraction, and tool-argument synthesis run through OpenAI; the model sees only text after STT, so the language layer is decoupled from the audio layer.

Persistent + live session memory

PostgreSQL stores cross-call memory (preferences, history); Redis holds live in-session state (current turn, pending tool calls, partial transcripts) so reconnections after a network blip resume cleanly.

LiveKit Cloud audio transport

WebRTC SFU handles bidirectional low-latency audio. The Python agent worker auto-joins each room created by the front-end token-issuer; no audio ever transits our application server.

Under the hood

Technology stack

Every layer of the stack — from database to 3D renderer.

TechnologyRole & contribution
LiveKit Cloud (Agents SDK + SFU)WebRTC audio transport + Python agent worker that auto-joins rooms
LangGraphStateful conversation DAG (greet → gather → confirm → fulfill → recover)
FastMCPTool exposure surface — lookup, booking, escalation calls during a turn
SarvamIndian-language speech-to-text + text-to-speech (Hindi, Telugu, Tamil, etc.)
GroqEnglish speech-to-text with sub-200ms first-token latency
OpenAILanguage understanding, intent extraction, tool-argument synthesis
PostgreSQLPersistent cross-call memory (preferences, history, audit log)
RedisLive in-session state (current turn, pending tool calls, partial transcripts)
Next.js 15 + LiveKit React SDKFront-end UI + server-side LiveKit access-token issuer
Ready to try it

Launch the live Voice Agent demo

Runs in your browser · camera processed locally · never stored.

Launch demo
AI platform

Build your own AI experience

Explore the full AIXcelerator platform — agents, skills, MCP servers, and the modular capability layers that power demos like this one.

Enterprise Platform

Build, govern, and scale AI programs from one operating layer

Colaberry aligns strategy, catalog discovery, and production workflows across agents, MCP, skills, and evidence-backed resources.