Voice Agent
Voice Agent is a production-grade conversational AI proof of concept that demonstrates a sub-1.2-second round-trip from end-of-utterance to first audio byte — the latency budget that separates a natural-feeling agent from one that feels broken. The platform combines LiveKit Cloud (WebRTC SFU + Python agent worker), a LangGraph state machine for conversation flow, FastMCP for tool exposure, Sarvam for Indian-language speech, Groq for English speech, and OpenAI for language understanding. PostgreSQL persists session memory; Redis holds live state. The architecture follows the latency budget, silence-handling, and conversation-design patterns from the Building Intelligent Voice Agents guide.
Detail-page tour · LiveKit + LangGraph + multi-language voice stack
Core capabilities
Hands-on features available when you launch the live demo.
Sub-1.2s round-trip latency
End-to-end pipeline tuned to keep total time from end of caller's utterance to first audio byte of the agent's response under the 1.2-second budget that separates natural conversation from a broken-feeling agent.
Filler-phrase handler against silence
When the language model is slow, a brief phrase like 'let me check on that for you' fills the gap without breaking flow. The agent never goes silent for more than 1.1 seconds — silence in a phone call is an error signal, not a neutral state.
LangGraph conversation state machine
Multi-turn dialog managed as an explicit DAG with named nodes (greet, gather, confirm, fulfill, recover) instead of free-form prompts — predictable, testable, and easier to evolve as flows grow.
FastMCP tool exposure
External tools (lookups, bookings, escalations) are surfaced through FastMCP so the agent can call them mid-conversation with bounded latency and structured arguments.
Multi-language speech (Sarvam + Groq)
Sarvam handles Indian-language voice (Hindi, Telugu, Tamil, etc.); Groq handles English STT with sub-200ms first-token. Voice routing is transparent to the language model.
OpenAI for language understanding
Conversation reasoning, intent extraction, and tool-argument synthesis run through OpenAI; the model sees only text after STT, so the language layer is decoupled from the audio layer.
Persistent + live session memory
PostgreSQL stores cross-call memory (preferences, history); Redis holds live in-session state (current turn, pending tool calls, partial transcripts) so reconnections after a network blip resume cleanly.
LiveKit Cloud audio transport
WebRTC SFU handles bidirectional low-latency audio. The Python agent worker auto-joins each room created by the front-end token-issuer; no audio ever transits our application server.
Technology stack
Every layer of the stack — from database to 3D renderer.
| Technology | Role & contribution |
|---|---|
| LiveKit Cloud (Agents SDK + SFU) | WebRTC audio transport + Python agent worker that auto-joins rooms |
| LangGraph | Stateful conversation DAG (greet → gather → confirm → fulfill → recover) |
| FastMCP | Tool exposure surface — lookup, booking, escalation calls during a turn |
| Sarvam | Indian-language speech-to-text + text-to-speech (Hindi, Telugu, Tamil, etc.) |
| Groq | English speech-to-text with sub-200ms first-token latency |
| OpenAI | Language understanding, intent extraction, tool-argument synthesis |
| PostgreSQL | Persistent cross-call memory (preferences, history, audit log) |
| Redis | Live in-session state (current turn, pending tool calls, partial transcripts) |
| Next.js 15 + LiveKit React SDK | Front-end UI + server-side LiveKit access-token issuer |
Launch the live Voice Agent demo
Runs in your browser · camera processed locally · never stored.
Build your own AI experience
Explore the full AIXcelerator platform — agents, skills, MCP servers, and the modular capability layers that power demos like this one.