~/articles/design-ai-voice-agent

◆◆◆Advancedasked at OpenAIasked at Deepgramasked at ElevenLabsasked at LiveKitasked at Twilio

Design a Realtime Voice AI Agent

Build a full-duplex voice agent that answers PSTN and WebRTC calls with sub-800ms voice-to-voice latency — covering the cascade pipeline (STT→LLM→TTS), turn detection, barge-in, telephony transport, and scaling to thousands of simultaneous calls.

33 min read2026-06-25Ironclad Academy

#interview #ai #llm #agents #realtime #voice

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

When Sierra launched its voice AI for enterprise customer service in 2024, it discovered the same thing every team building voice agents discovers: the LLM is not the bottleneck. The model is fast enough. The bottleneck is the fifteen other decisions between the user finishing a sentence and audio playing out of their earpiece — and almost all of them default to the wrong value.

A voice agent is fundamentally a real-time media system that happens to have a language model in the middle. Unlike a chat interface where 2–3 seconds of response time is acceptable, voice conversation works on human speech rhythm. Once the gap between a user finishing their sentence and hearing a response exceeds roughly 800ms, users perceive the system as broken. At 1.2s they start speaking over the agent. At 1.5s+ the conversation loses coherence entirely. Hamming AI analyzed 4 million production voice calls and found that the industry median is 1.4–1.7s at P50 — well outside the acceptable window — with P95 at 4.3–5.4s.

The gap between achievable (520ms best-case cascade) and typical (1.4s median) is entirely engineering. This article is about closing that gap.

For agent tool-calling and durable orchestration, see Design an AI Agent Platform. This article focuses on the voice-specific layers: the media pipeline, latency budget, turn detection, barge-in, and telephony integration.

Functional requirements

Accept inbound phone calls (PSTN via SIP) and WebRTC connections from browser/app clients.
Transcribe user speech in real time and detect end-of-turn accurately.
Generate a voice response within 800ms of end-of-speech at P50, under 1.2s at P95.
Stop speaking immediately when the user interrupts (barge-in).
Execute tool calls mid-conversation — CRM lookup, calendar booking, database query, RAG retrieval — and incorporate results into the response.
Maintain per-call conversation state across an arbitrary number of turns.
Escalate to a human agent when triggered; pass a structured handoff packet so the human doesn't ask the caller to repeat themselves.
Store transcripts, timing data, and structured events for analytics, QA, compliance, and fine-tuning.

Non-functional requirements

P50 voice-to-voice latency under 800ms; P95 under 1.2s.
5,000+ simultaneous calls with horizontal scaling to 50,000+.
99.9% availability; a crashed media worker must not drop calls silently.
Transcripts and audio comply with applicable regulations (HIPAA for healthcare, GDPR for EU).
Barge-in latency under 60ms from speech detection to audio stop.
Graceful degradation under LLM provider outage: fall back to a different model or hold music with error messaging, not a silent dead line.

Capacity estimation

Dimension	Estimate	How we got there
Simultaneous calls	5,000	Design target; scales to 50,000 with proportional infrastructure
Audio bandwidth per call	~48 Kbps (Opus)	Standard WebRTC/VoIP encoding; G.711 PSTN is 64 Kbps before transcoding
Total audio bandwidth	240 Mbps inbound + 240 Mbps outbound	5,000 × 48 Kbps × 2 directions
LLM calls per call-minute	~2 (avg turn every 30s)	Customer service calls average 1–2 turns per minute
LLM tokens per turn	~800 in / ~150 out	System prompt (500t) + history (200t) + user utterance (100t); short voice responses
STT cost	~$0.0043/min	Deepgram Nova-3 at $0.0043/min (pay-as-you-go)
LLM cost per turn	~$0.006	800 in × $5/M = $0.004 + 150 out × $15/M = $0.00225 ≈ $0.006
TTS cost	~$0.03/min	ElevenLabs Flash at $0.03/min
All-in cost per call-minute	~$0.046–$0.19	STT $0.004 + LLM $0.012 (2 turns/min × $0.006/turn) + TTS $0.03 ≈ $0.046 self-hosted; managed platforms add $0.05–$0.14/min platform fee (Vapi/Retell/Bland), capping the range at ~$0.19/min
Transcript storage	~135 MB/day	5,000 calls × 30 min × 150 words/min × 6 bytes/word = 135 MB
Concurrent sessions per pod	30–40	FastAPI/uvicorn process; event-loop contention above 40 Realtime sessions
Pods required for 5,000 calls	~170	5,000 / 30 calls/pod × 1.02 headroom

Takeaway: At $0.046–$0.19/min all-in, 5,000 calls/day × 30 min avg × 30 days = 4.5M min/month → $207,000–$855,000/month in direct API costs (self-hosted base ~$207K; fully managed ~$855K). Managed platforms (Vapi at $0.05/min, Retell at $0.055/min) add overhead but eliminate infra work. The break-even for self-hosting via LiveKit + direct APIs vs. managed platforms is around 10,000 minutes/month.

Building up to the design

V1: Naive sequential pipeline

The obvious first design is sequential: record the user's complete utterance, send it to a speech-to-text API, wait for the full transcript, send that to an LLM, wait for the full LLM response, send that to a text-to-speech API, play the audio. This works for a demo.

The latency is:

Silence detection (500ms) + STT batch (400ms) + LLM full response (1,500ms) + TTS batch (300ms) = 2,700ms

That is more than three times the target. No amount of model tuning closes a 3-second gap.

The failure mode is not slowness — it is structure. Every stage waits for the previous one to complete before starting. The solution is not to make each stage faster (though that helps); it is to eliminate the waiting.

V2: Streaming everywhere

The fix is to pipeline every stage. The moment STT produces its first partial transcript word, start feeding it to the LLM. The moment the LLM produces its first sentence, start synthesizing audio. The moment TTS produces its first audio chunk, start streaming it back.

Turn detection (50ms) + STT first word (150ms) → LLM TTFT (200ms) → TTS first chunk (40ms) + network (80ms) ≈ 520ms

This is additive along the critical path but each stage is overlapped with the one before it. Without streaming you wait for each stage to fully complete before the next begins (~2,700ms total); with streaming you only wait for each stage's time-to-first-token before the next stage starts. The result is ~520ms rather than ~2,700ms.

The critical observation: streaming transforms total latency from the sum of stage completion times (~2,700ms) to the sum of stage first-token times (~520ms). This is not an optimization; it is the architecture.

But V2 has two remaining problems. First, you still need to know when the user has finished their sentence (turn detection), and default approaches add 500ms–1,500ms. Second, you have no mechanism for the user to interrupt the agent while it's speaking.

V3: Proper turn detection and barge-in

Turn detection determines when to stop listening and start responding. There are three approaches:

VAD-only: wait for 500–800ms of silence. Simple, but adds that entire silence window to every response. If a user pauses mid-sentence ("I'd like to book a... um... Tuesday appointment"), the agent fires early. Tune the threshold higher to fix pauses and you add more delay.

STT endpointing: let the STT provider signal end-of-turn via events like Deepgram's speech_final=true or utterance_end. The provider combines acoustic silence with punctuation prediction. This is the recommended default. Deepgram Nova-3 achieves this in 150–250ms after the last spoken word.

Model-based semantic detection: Deepgram Flux is a single model that fuses transcription and turn detection, firing an EagerEndOfTurn event 150–250ms earlier than pipeline approaches by predicting conversational completion even mid-sentence. LiveKit ships a similar semantic turn classifier. The cost: 50–70% more LLM calls triggered (because you start the LLM earlier, sometimes before the user has truly finished), which requires speculative execution and occasional discard.

For barge-in: run VAD on the inbound audio track at all times, even while TTS is playing. When VAD detects user speech: (1) cancel the pending LLM response, (2) send close or flush to the TTS WebSocket within 60ms, (3) drop (not drain) queued audio in the jitter buffer. The 60ms deadline is hard — beyond that, users perceive the agent as unresponsive to the interruption.

V4: Full-duplex with telephony

The final architectural step is handling real telephony. PSTN calls arrive via SIP trunks from Twilio, Telnyx, or Vonage as G.711 μ-law audio at 8kHz. Modern STT models are trained on 16kHz and produce measurably higher word error rates on 8kHz input. You must transcode exactly once at the media gateway to PCM 16kHz or Opus 16kHz. Never pass raw G.711 to a production STT model.

The media path for PSTN: SIP trunk → media gateway (FreeSWITCH / Asterisk / Twilio ConversationRelay) → transcode to PCM 16kHz → jitter buffer → echo cancellation → your agent pipeline. For WebRTC: browser → SFU (LiveKit) → Opus decode → same agent pipeline from jitter buffer onward.

flowchart TD
    A["PSTN caller<br/>G.711 μ-law 8kHz"] --> B["SIP trunk<br/>Twilio / Telnyx"]
    B --> C["Media Gateway<br/>transcode once to PCM 16kHz"]
    D["WebRTC caller<br/>Opus 48kHz"] --> E["SFU<br/>LiveKit"]
    E --> F["Opus → PCM 16kHz decode"]
    C --> G["Jitter buffer<br/>20–60ms"]
    F --> G
    G --> H["AEC<br/>echo cancellation"]
    H --> I["VAD<br/>Silero — 10–30ms frames"]
    I --> J["Streaming STT<br/>Deepgram Nova-3<br/>TTFT 150ms"]
    J --> K["Turn Detector<br/>speech_final / Flux EOT"]
    K --> L["LLM<br/>Groq / GPT-4o<br/>TTFT 160–500ms"]
    L --> M["TTS<br/>Cartesia Sonic<br/>TTFA 40ms"]
    M --> N["Audio out<br/>SFU / Media Gateway"]
    style C fill:#ff6b1a,color:#0a0a0f
    style J fill:#0e7490,color:#fff
    style K fill:#a855f7,color:#fff
    style L fill:#ffaa00,color:#0a0a0f
    style M fill:#0e7490,color:#fff
    style H fill:#15803d,color:#fff
    style I fill:#15803d,color:#fff

API

The orchestrator exposes a REST API for provisioning voice agents, and the media pipeline communicates via WebSockets internally.

POST /v1/agents
{
  "system_prompt": "You are a scheduling assistant for Acme Dental...",
  "voice_id": "cartesia:sonic-3-turbo:en-female-calm",
  "model": "gpt-4o",
  "tools": ["calendar_lookup", "appointment_book", "crm_get_patient"],
  "turn_detection": { "mode": "semantic", "threshold": 0.7, "silence_ms": 600 },
  "barge_in": true,
  "max_turns": 30,
  "escalation_triggers": ["speak to a human", "not understanding", "frustrated"]
}
→ { "agent_id": "agt_7xKp..." }

POST /v1/calls
{
  "agent_id": "agt_7xKp...",
  "transport": "twilio",
  "call_sid": "CA8f3...",
  "caller_number": "+14155551234",
  "metadata": { "patient_id": "P-4892" }
}
→ { "call_id": "call_9aRz...", "websocket_url": "wss://media.example.com/call_9aRz" }

GET /v1/calls/{call_id}/transcript
→ {
    "turns": [
      { "speaker": "user", "text": "Hi, I need to reschedule my appointment", "start_ms": 0, "end_ms": 2340, "confidence": 0.97 },
      { "speaker": "agent", "text": "Of course! What's your name?", "start_ms": 2791, "ttfa_ms": 451, "tts_model": "cartesia-sonic-3" }
    ],
    "tool_calls": [
      { "name": "crm_get_patient", "args": {"name": "Jane Smith"}, "duration_ms": 87, "status": "success" }
    ],
    "summary": "...",
    "sentiment_final": "satisfied"
  }

POST /v1/calls/{call_id}/transfer
{
  "method": "sip_refer",
  "destination": "+18005557890",
  "handoff_packet": { "summary": "...", "intent": "reschedule", "entities": {...} }
}

The schema

Each call session is tracked as a document combining telephony metadata, live state, and post-call analytics:

{
  "call_id": "call_9aRz...",
  "agent_id": "agt_7xKp...",
  "call_sid": "CA8f3...",
  "caller": "+14155551234",
  "status": "active",
  "started_at": "2026-06-25T14:32:11Z",
  "transport": "twilio_pstn",
  "turns": [
    {
      "index": 0,
      "speaker": "user",
      "text": "Hi, I need to reschedule my appointment",
      "audio_start_ms": 1240,
      "speech_final_ms": 3580,
      "ttfa_ms": 451,
      "confidence": 0.97
    }
  ],
  "tool_calls": [],
  "context_turns_in_window": 4,
  "session_rotations": 0,
  "metrics": {
    "p50_ttfa_ms": 438,
    "p95_ttfa_ms": 891,
    "barge_in_count": 1
  },
  "escalation": null,
  "ended_at": null,
  "recording_url": "s3://voice-recordings-us/2026/06/25/call_9aRz.opus"
}

The hot path writes turn data to Redis for sub-millisecond read access by the session manager. Postgres stores the durable session record. Raw audio goes to S3 with a 90-day retention lifecycle before transitioning to Glacier.

Architecture

Production system

flowchart LR
    subgraph EDGE["Media Edge - per region"]
        MGWY["Media Gateway<br/>FreeSWITCH / LiveKit"]
        AEC2["AEC + Jitter"]
    end
    subgraph WORKERS["Agent Workers - K8s"]
        W1["Pod 1<br/>30–40 calls"]
        W2["Pod 2<br/>30–40 calls"]
        WN["Pod N<br/>30–40 calls"]
    end
    subgraph AI["AI Services"]
        STT2["Deepgram Nova-3<br/>streaming WebSocket"]
        LLM2["Groq / GPT-4o<br/>streaming"]
        TTS2["Cartesia Sonic<br/>streaming WebSocket"]
        TOOLS2["Tool Gateway<br/>CRM / Cal / RAG"]
    end
    subgraph STATE2["State"]
        REDIS2[("Redis<br/>active sessions")]
        PG[("Postgres<br/>call records")]
        S3["S3<br/>recordings"]
    end
    subgraph OBS2["Observability"]
        KAFKA2["Kafka"]
        ES2["Elasticsearch<br/>hot 30d"]
        CK2["ClickHouse<br/>analytics"]
        PII["PII Redactor"]
    end
    MGWY --> AEC2 --> W1 & W2 & WN
    W1 & W2 & WN --> STT2
    W1 & W2 & WN --> LLM2
    LLM2 --> TOOLS2 --> LLM2
    W1 & W2 & WN --> TTS2
    TTS2 --> MGWY
    W1 & W2 & WN --> REDIS2
    W1 & W2 & WN --> PG
    W1 & W2 & WN --> S3
    W1 & W2 & WN --> KAFKA2
    KAFKA2 --> PII --> ES2 & CK2
    style MGWY fill:#ff6b1a,color:#0a0a0f
    style STT2 fill:#0e7490,color:#fff
    style TTS2 fill:#0e7490,color:#fff
    style LLM2 fill:#ffaa00,color:#0a0a0f
    style TOOLS2 fill:#ff6b1a,color:#0a0a0f
    style REDIS2 fill:#ff2e88,color:#fff
    style PII fill:#a855f7,color:#fff

Each Kubernetes pod runs a FastAPI/uvicorn process handling 30–40 concurrent calls. This limit is not CPU — it is Python event-loop contention from maintaining 30+ simultaneous WebSocket connections (one each to STT, TTS, and the media gateway per call). Scale is driven by the active_calls Prometheus metric published per pod, not CPU utilization. The horizontal pod autoscaler watches this metric with a target of 25 active calls per pod to maintain headroom for bursts.

Sticky routing via sessionAffinity: ClientIP with a 3,600-second timeout ensures the same pod handles a call from start to finish — critical because per-call state (the ChatContext, STT WebSocket, TTS WebSocket) lives in-process memory. On SIGTERM, pods stop accepting new calls and drain existing ones gracefully with a 30-second window.

Hot-path turn sequence

sequenceDiagram
    participant U as Caller
    participant MG as Media Gateway
    participant POD as Agent Pod
    participant STT as Deepgram
    participant LLM as GPT-4o
    participant TTS as Cartesia

    U->>MG: I need to cancel my appointment
    MG->>POD: PCM 16kHz audio stream continuous
    POD->>STT: audio frames 10ms chunks
    STT-->>POD: interim transcript is_final=false t=+150ms
    STT-->>POD: speech_final=true t=+280ms
    Note over POD: End-of-turn detected t=+280ms
    POD->>LLM: stream=true max_tokens=150 messages=history
    Note over POD,LLM: TTFT target 200ms Groq / 500ms GPT-4o
    LLM-->>POD: Of course I can help — first tokens t=+480ms
    POD->>TTS: send text chunk Of course I can help
    TTS-->>POD: first audio chunk t=+520ms
    POD->>MG: audio stream
    MG->>U: plays audio t=+600ms approx 320ms TTFA from speech_final
    LLM-->>POD: with that — more tokens t=+550ms
    POD->>TTS: send text chunk with that
    LLM-->>POD: function_call crm_get_patient t=+620ms
    Note over POD: Pause TTS. Execute tool.
    POD->>TTS: send Let me pull up your account
    POD->>+LLM: tool_result patient_id P-4892 next_appt July 3rd
    LLM-->>-POD: I see your appointment on July 3rd t=+920ms
    POD->>TTS: send final response
    TTS-->>POD: audio
    POD->>MG: audio
    MG->>U: plays final response

The sequence shows the key latency landmarks: end-of-speech at +280ms, LLM first tokens at +480ms, first audio playing at +600ms from the start of the user's utterance. That 320ms from speech_final to first audio is the number to optimize, not the wall-clock time from the beginning of the utterance.

Deep dives

Latency budget: the 800ms breakdown

The 800ms target is not arbitrary — it matches human conversational pause expectations. Here is how to allocate it:

Stage	Budget	Achievable best	Notes
Turn detection	0ms (overlapped)	50–280ms	Run in parallel with speech; use STT endpointing
STT first word latency	150ms	90ms (AssemblyAI)	From audio start, overlapped with VAD
STT end-of-turn signal	included above	Deepgram: 150–250ms after last word	`speech_final` after last word
LLM time-to-first-token	200ms	160ms (Groq Llama) / 300ms (Gemini Flash) / 500ms (GPT-4o)	Budget for model choice
TTS time-to-first-audio	~40ms	40ms (Cartesia Sonic Turbo) / 75ms (ElevenLabs Flash)	From first text to first audio chunk
Network / jitter	~80ms	30ms (same-region) / 100ms (cross-region US)	Colocate media edge with AI services
Total (WebRTC)	~520ms	460–520ms achievable	Within budget
PSTN overhead	+200–400ms	+150ms (Twilio backbone)	SIP adds legs; Twilio ConversationRelay cuts this to 150ms

Twilio measured their own ConversationRelay achieving under 500ms median and under 725ms at P95. The biggest single variable is model choice: switching from GPT-4o (500ms TTFT) to Groq-hosted Llama (160ms TTFT) saves 340ms — more than any other optimization.

Turn detection in depth

Turn detection is the most misunderstood component in voice AI, and also the highest-impact one. A poorly tuned turn detector adds 1.5–4s to every response before the pipeline even starts. Getting this right matters more than model selection.

The three modes compared:

VAD silence-based detection classifies each 10–30ms audio frame as speech or silence, then fires end-of-turn after a configurable silence gap. Silero VAD achieves 87.7% true-positive rate at 5% false-positive rate versus WebRTC VAD's 50% TPR — a meaningful difference in production. The fundamental problem is the silence threshold: at 500ms, mid-sentence pauses trigger early; at 1,500ms (the common default), you add 1.5s to every response.

STT endpointing uses the provider's own end-of-turn model. Deepgram Nova-3 emits speech_final=true when its acoustic model determines the utterance is complete, typically 150–250ms after the last word. This is significantly more accurate than raw VAD silence because it accounts for sentence structure and typical speech patterns. Set utterance_end_ms to 1,000ms as a backstop for genuinely ambiguous cases.

Model-based semantic detection (Deepgram Flux, LiveKit semantic VAD) trains a transformer classifier on conversational end-of-turn patterns. Flux fires an EagerEndOfTurn event 150–250ms before a sentence is fully complete by predicting whether the speaker will continue. In practice, this means the LLM starts processing while the user is still saying their final word. Deepgram reports a 30% reduction in false interruptions versus pipeline approaches. The tradeoff: you trigger LLM calls for speculative responses that get discarded when the user continues — about 60% of speculative responses are discarded, 40% commit. The net saving for those 40% is ~500ms per committed response.

For most production deployments: use STT endpointing (speech_final) as the baseline, tune utterance_end_ms down from the provider default (often 2,000ms) to 800–1,000ms, and only add model-based detection after the rest of the pipeline is stable.

Barge-in and interruption handling

Barge-in is full-duplex: the agent must process incoming user audio continuously, even while playing TTS output. This requires the audio stack to run two independent streams — outbound TTS and inbound user audio — simultaneously, without the outbound audio feeding back into the inbound stream. Acoustic echo cancellation (AEC) makes this possible.

AEC must run at the client (WebRTC has it built-in) or at the media gateway (PSTN deployments require explicit configuration). If AEC is skipped, the agent hears its own TTS output via the microphone, VAD fires, STT transcribes the agent's own words, and the LLM generates a response to itself. This is a catastrophic failure mode that looks like an infinite loop.

When the VAD detects user speech while TTS is playing:

Send response.cancel to the LLM if using OpenAI Realtime API, or abort the streaming connection otherwise.
Close the TTS WebSocket (or call its cancellation endpoint). Cartesia Sonic and ElevenLabs Flash both support mid-stream cancellation via WebSocket close.
Drop all queued audio in the jitter buffer immediately. Do not drain — drain plays out the buffered audio first, meaning the user hears the tail of the response they just interrupted.
Restart the STT context for the new utterance.

The 60ms deadline from speech detection to audio stop is empirically derived: above 60ms, users perceive the agent as slow to respond to the interruption. This is a tighter real-time constraint than the 800ms response target.

Two tricky barge-in cases:

Filler sounds: "uh-huh", "mm", "yeah" during the agent's speech are social backchannels, not interruptions. Firing barge-in on every filler sound destroys conversation flow. The mitigation is a confidence threshold on the VAD output — treat frames below 0.6 confidence as backchannels — and requiring at least 200ms of sustained speech before triggering barge-in.

Mid-tool-call barge-in: the user speaks while the agent is silently waiting for a tool call to return. There is no active TTS to cancel. The orchestrator must still reset context and decide: abandon the tool call (simpler but loses the result), or surface the tool result while incorporating what the user just said (more complex, usually better). Most implementations choose to finish the tool call if it completes within 500ms and surface the result.

Function calling and tool integration

Tool calls in a voice conversation create audible dead time. Every tool call that takes more than 300ms will be perceived as the agent "thinking" — acceptable once, awkward twice, alarming three times.

The latency impact is substantial. Twilio measured that each tool call adds 400–800ms to a response turn even when the tool itself responds in under 200ms. The overhead comes from the LLM emitting the function call JSON mid-stream, the orchestrator parsing it, dispatching to the tool, waiting for the result, formatting it back into the LLM context, and resuming generation. GPT-4o function calling accuracy on multi-turn voice benchmarks is only 50% (Daily.co, June 2025); the gpt-realtime model achieves 66.5% on ComplexFuncBench — meaningful progress but still not reliable enough to skip validation layers.

Patterns for managing tool call latency:

For tools under 200ms (typical CRM read, Redis cache hit): block synchronously. The 400–800ms total overhead is perceptible but users accept it as a natural mid-turn pause — note that this overhead alone approaches the 800ms P50 target, so the full turn will exceed it; keep fast tool calls genuinely fast.

For tools over 200ms (external API, RAG retrieval, database write): use the async pattern. RAG retrieval — chunking, vector index lookup, and reranking — is itself a non-trivial pipeline; see Design a RAG Pipeline for the full stack. Immediately inject a filler response ("Let me check on that for you") as pre-synthesized audio cached at startup — zero TTS latency. Submit the tool call as a background task. When it completes, inject the result into the conversation context and generate the real response. The user hears the filler, then hears the answer a moment later without silence.

For parallel tool calls: Sierra's architecture runs concurrent LLM inference branches for independent operations — abuse detection, intent classification, and tool execution run simultaneously rather than serially. This cuts per-turn latency proportional to the number of parallel operations.

Cache frequently called tools. If your agent answers "what are your business hours?" dozens of times per day, the CRM lookup returning the same static data should be Redis-cached with a 1-hour TTL, not a live API call each time. Pre-generate and cache TTS audio for common phrases ("Let me check on that", "Your appointment is confirmed", "Thank you for calling") so they play at zero TTS latency.

Session state and context management

Each call maintains a ChatContext object in the agent pod's memory: the full message history, tool call results, current system prompt, and telephony state. This is the conversational working memory.

The context management problem: unbounded growth degrades latency measurably. OpenAI Realtime API's median turn latency grows from approximately 800ms at turn 1 to over 2,000ms after 20 turns as the context window expands. The model spends more time processing more tokens, and billing per audio token means costs compound.

Three mitigation strategies:

Session rotation: every 8–12 turns, open a fresh API session, rehydrate it with a compressed summary of the conversation so far (generated by a quick LLM call), and continue. The compression call takes 300–500ms but amortizes over the next 8–12 turns and pays off in lower per-turn latency. LiveKit's agent framework and Pipecat both implement session rotation as first-class features.

Selective pruning: use OpenAI Realtime API's conversation.item.delete to remove old turns from the context. Keep the system prompt, the last 4–6 turns of actual dialogue, and any injected tool results that are still active. Prune earlier turns whose entities have already been resolved.

State-machine workflow: decompose the conversation into explicit states (greeting → intent capture → lookup → confirm → close) with a focused system prompt for each state and a fresh context per state transition. This bounds context growth by design. It works well for structured tasks (appointment booking, order lookup) but less well for open-ended conversations.

For cross-session state — caller's name, account details, prior history — store in Redis keyed by phone number or account ID and inject a compact summary into the system prompt at call start. Do not bloat the context with multi-call history; summarize it.

Telephony and transport

PSTN via SIP. Twilio, Telnyx, and Vonage provide SIP trunks that deliver G.711 μ-law audio at 8kHz. The recommended integration point is Twilio ConversationRelay (2025), which handles STT+TTS+barge-in+session management at the TwiML layer — Twilio reports under 500ms median, under 725ms P95. For maximum control, Twilio Media Streams bridges raw audio to your WebSocket server and you manage the entire pipeline.

FreeSWITCH and Asterisk are open-source alternatives for SIP→AI pipelines; FreeSWITCH's ESL (Event Socket Layer) gives programmatic call control from Python or Go, handles codec transcoding, DTMF, and conference bridging.

WebRTC. LiveKit (open-source Go SFU) is the standard for WebRTC voice agents. The agent joins a LiveKit Room as a headless participant; the SFU handles DTLS/SRTP, ICE trickle, NAT traversal, adaptive bitrate, and packet loss concealment. LiveKit Agents v1.5 (2025) ships native SIP and phone numbers, eliminating the need for a separate Twilio bridge for many deployments. Pipecat (from Daily.co) recommends WebRTC over WebSocket for production audio because WebRTC uses UDP with congestion control, whereas WebSocket over TCP causes head-of-line blocking under packet loss — audible as stuttering.

Codec chain. PSTN delivers G.711 μ-law at 8kHz. WebRTC typically delivers Opus at 16–48kHz. STT models expect PCM 16kHz or Opus 16kHz. Transcode exactly once at the media gateway — never forward untranscoded G.711 to a modern STT model. The codec quality difference is measurable: wideband G.722 (64kbps, 16kHz) or Opus 16kHz produces materially lower STT word error rates than G.711 8kHz.

Jitter buffer. SIP over public internet introduces packet reorder and loss. A 20–60ms jitter buffer absorbs this without audible artifacts. WebRTC stacks include a jitter buffer; SIP pipelines need explicit configuration (RTPengine, FreeSWITCH's jitter module). Colocate the jitter buffer and AI inference in the same region — every millisecond of round-trip between them is two milliseconds of added latency.

Scaling to thousands of concurrent calls

One thousand concurrent calls requires approximately 34 Kubernetes pods (1,000 / 30 calls/pod = 33.3, round up). Five thousand calls requires ~170 pods (ceil(5,000/30) = 167 × 1.02 headroom). Ten thousand is ~341 pods (10,000 / 30 = 334, × 1.02 headroom). The mathematics are simple; the operational details are not.

CPU is the wrong scaling signal. Voice agent pods spend most of their time in I/O wait — streaming to/from WebSockets for STT, LLM, and TTS. CPU utilization reads low (10–20%) while the event loop is actually at capacity. Scale on active_calls, a Prometheus gauge published by each pod, not on CPU. Set the HPA target at 25 calls/pod to leave headroom for call-start spikes.

Stateful sticky routing. Each call's state — ChatContext, three WebSocket connections, VAD state machine — lives in the agent pod's process memory. Calls must be routed to the same pod throughout their lifecycle. Use Kubernetes sessionAffinity: ClientIP with a 3,600-second timeout. If the pod dies mid-call, the call is lost (the phone call drops or goes to hold music); persist the conversation history to Redis on every turn so a replacement pod can resume from a summary.

Connection pool management. Creating a new HTTP/WebSocket connection to Deepgram, Cartesia, or OpenAI per call adds 20–100ms of TLS handshake overhead. Maintain a pool of pre-established WebSocket connections per pod and check them out at call start. This is especially important for TTS where time-to-first-byte is measured from the HTTP request.

Concurrency limits as backpressure. OpenAI Realtime API has per-organization concurrency limits. Track active sessions against the quota in Redis and reject (with a graceful message) rather than queue. A queued call that picks up after 30 seconds of silence is worse than a call that gets a "high volume" message and an offer to be called back.

Multi-region deployment. Deploy media edge nodes and AI inference workers in the same region. Cross-US adds 60–80ms; cross-continental adds 200–300ms. An architecture with the media gateway in US-East and the AI inference in US-West adds 120–160ms to every interaction with no benefit. Minimum viable multi-region: US-East + US-West + EU-West, with anycast DNS routing callers to the nearest PoP.

Observability and post-call analytics

See Design an LLM Observability Platform for the full observability architecture. For voice agents specifically, the signal hierarchy is:

Layer 1 — telephony/audio quality: DTMF detection, packet loss rate, jitter, MOS (Mean Opinion Score) if available from your SFU, call duration, disconnect reason.

Layer 2 — ASR/transcription: word confidence distribution, utterance silence ratio, STT error events, WER (measured via QA sample against human transcripts).

Layer 3 — LLM/semantic: TTFT per turn, tool call success/failure rate, intent classification accuracy, context window utilization, session rotation events.

Layer 4 — TTS: TTFA per turn, TTS cancellations (barge-in count), audio delivery errors.

The primary latency metric is TTFA (Time to First Audio), measured from speech_final event to first audio byte delivered to the caller. Track P50/P90/P95 separately per pipeline stage. TTFA masks the most common failure: a slow LLM turn is invisible if you only track total call duration.

Route all events through Kafka or Kinesis before storage. This is the PII enforcement boundary — implement redaction (strip names, phone numbers, medical terms) at the stream processor before writing to Elasticsearch or ClickHouse. Raw audio goes to S3 with a 90-day retention policy; scrubbed transcripts go to the warm analytics tier; structured events (tool calls, latency metrics, intent labels) go to the 7-year compliance archive.

Post-call processing (after hang-up): run heavier summarization models (no latency constraint), sentiment classification, topic tagging, QA scoring. Flag calls where the agent failed to understand twice in a row, where the caller requested a human but didn't get one, or where sentiment at hang-up was negative. Feed high-quality transcripts back into the fine-tuning data pipeline for the STT model.

Human handoff

The escalation path is as important as the conversation path. Triggers for human escalation:

Two consecutive "I didn't understand that" responses (three-strike rule: rephrase once, alternative approach once, then escalate).
Explicit caller request ("talk to a person", "let me speak to someone").
Sentiment score below threshold (measured by a lightweight parallel sentiment model, not the main LLM).
Compliance keyword detection (regulated industry-specific terms that require a human in the loop).
Tool call repeated failure (CRM lookup failed twice — don't loop, escalate).
Maximum call duration reached.

Transfer mechanism: SIP REFER for blind transfers, warm transfer API (Twilio's) for introducing the caller. Always pass a structured handoff packet — conversation summary, extracted entities, unresolved intent, emotional state, prior attempts — so the receiving human agent has context without asking the caller to repeat their problem. Nothing degrades customer experience faster than being transferred and asked to start over.

Edge cases & gotchas

Echo feedback catastrophe. Skip AEC and the agent transcribes its own TTS output, generates a response to itself, plays that, transcribes that, and generates another response. The call loops until the caller hangs up in frustration. Always configure AEC before any other optimization.

Long-session latency drift. Teams frequently notice in production that their voice agent, which performs well in testing, gets progressively slower over long calls. OpenAI Realtime's median turn latency grows from ~800ms early in the session to over 2s after 20+ turns as the context window expands. The fix (session rotation) is well-known but requires explicit implementation — no provider applies it automatically.

G.711 at 8kHz fed to STT. PSTN calls arrive at 8kHz. STT models trained on 16kHz produce materially higher word error rates on 8kHz input. This is the single most common audio quality mistake in voice AI deployments that integrate with telephony for the first time.

Audio buffer not dropped on barge-in. The jitter buffer typically holds 200–400ms of queued TTS audio for resilience. If you call stop() on the TTS stream but do not flush the jitter buffer, the user hears the tail of the previous response after interrupting. Call flush() / drop packets, not drain().

Scaling on CPU metrics. Voice agent pods read 10–20% CPU while running 30+ concurrent WebSocket streams. The event loop is saturated; the CPU is idle. CPU-based autoscaling fires far too late — after you are already serving degraded latency to dozens of callers. Build and expose active_calls as a Prometheus gauge from day one.

Cold connection overhead. Creating a new TLS connection to Deepgram or Cartesia per call adds 20–100ms before any audio is processed. At 40ms TTS TTFA, this doubles the effective first-audio latency. Pre-warm connection pools at pod startup and recycle connections across calls.

Synchronous long tool calls. Any tool call that blocks the conversation loop for more than 500ms creates audible silence. Users interpret silence as system failure after about 2 seconds. Never await a slow external API synchronously. Use the async filler-audio pattern for any tool that might take over 300ms.

Function calling accuracy. GPT-4o achieves approximately 50% accuracy on multi-turn voice function-calling benchmarks. This is the most common source of silent failures in production voice AI — the model calls the wrong function, omits required parameters, or calls the function with hallucinated arguments. Add a validation layer: schema-validate all function call arguments before dispatching, and have the orchestrator retry with explicit error feedback rather than silently failing.

Infinite misunderstanding loop. An agent without a hard escalation limit will ask "I'm sorry, could you repeat that?" until the caller hangs up. Implement the three-strike rule at the orchestrator level: track consecutive no-understand events, and escalate to human unconditionally on the third. No LLM decision involved — it is a hard counter check.

Trade-offs to discuss in an interview

Cascade (STT→LLM→TTS) vs. speech-to-speech. S2S achieves 200ms theoretical latency and preserves prosody; cascade achieves 520ms best-case but provides text logs at every boundary, HIPAA-eligible transcripts as natural artifacts, mature tool-calling, and provider flexibility. For regulated industries, cascade is the only viable option. For consumer products where emotional naturalness drives conversion, S2S is worth evaluating — but benchmark it in your specific use case.

Managed platform vs. self-hosted. Vapi ($0.05/min platform fee), Retell ($0.055/min), Bland AI ($0.11–$0.14/min) reduce time-to-market from months to weeks. Self-hosted (LiveKit + Pipecat + direct APIs) has lower marginal cost above ~10,000 minutes/month and gives full pipeline control. For startups: start managed. For scale: self-host after product-market fit.

Speculative LLM pre-fetch. Start the LLM call on a partial transcript before end-of-turn is confirmed. Saves ~500ms on the 40% of turns where the speculative response commits. Requires rollback logic for the 60% of discards and careful handling of tool calls started speculatively. Worth implementing after the baseline pipeline is stable and you have production traffic to measure the commit rate.

LLM model selection. Groq-hosted Llama (160–200ms TTFT) versus GPT-4o (500ms TTFT) is a 340ms difference — more than any other single optimization. But GPT-4o has higher function-calling accuracy and better instruction following. Consider adaptive routing: simple turns use the fast small model; detected complex tasks (multi-tool, ambiguous intent) route to the larger model at the cost of one slower turn.

Real-time transcripts vs. post-call only. Streaming transcripts to the caller's CRM or ticketing system in real time enables agents to look up a customer before the call ends but adds a live-write path with its own failure modes. Post-call processing is simpler and allows heavier models for summarization. The choice depends on whether your use case requires human-in-the-loop during the call.

Things you should now be able to answer

Why does streaming at every stage (STT, LLM, TTS) reduce total latency from the sum of stage completion times (~2,700ms) to the sum of stage first-token times (~520ms)?
Walk through the three turn detection strategies and their latency-accuracy trade-offs. Which would you choose as a default, and why?
What is the exact sequence of operations when a user speaks over a playing TTS response (barge-in)? What breaks if you drain the audio buffer instead of flushing it?
Why must acoustic echo cancellation be configured before any other optimization, and what catastrophically happens if you skip it?
Why does OpenAI Realtime API's turn latency grow from 800ms to 2s+ after 20 turns, and what are the three mitigation strategies?
A PSTN call arrives via Twilio SIP trunk. Trace the audio from the caller's microphone to the Deepgram STT model, naming every format conversion and why it's needed.
Your voice agent pods are at 15% CPU but experiencing latency spikes. Why, and what metric should you actually scale on?
GPT-4o achieves only 50% accuracy on multi-turn voice function-calling benchmarks. How do you design around this?
When should you choose a managed platform (Vapi, Retell) vs. self-hosted (LiveKit, Pipecat), and what is the rough break-even point?
Describe the human handoff flow, including what data you pass to the receiving agent and how you transfer the call at the SIP level.

Frequently asked questions

▸Why do most production voice AI deployments miss the 800ms latency target even with fast components?

Because the bottleneck is rarely any single component — it is the sum of sequential decisions that each seem reasonable in isolation. Default turn detection settings alone add 1.5–4 seconds (providers often default to a 1.5s silence wait before triggering). Add sequential processing — waiting for LLM completion before starting TTS — and you lose another 500ms to 1s. Then add geographic distance: each US cross-region hop adds 60–80ms, and cross-continental is 200–300ms. Hamming AI analyzed 4 million production calls and found P50 latency of 1.4–1.7s, P95 at 4.3–5.4s — far above the 800ms target even with capable components in place.

▸What is the difference between VAD-based, STT-based, and model-based turn detection?

VAD (Voice Activity Detection) classifies audio frames as speech or silence and triggers end-of-turn after a configurable silence gap — simple and fast but adds that entire silence window to every response. STT endpointing uses provider signals like Deepgram's `speech_final` event, which combines acoustic VAD with punctuation-based sentence completion — better accuracy, similar latency. Model-based detection (Deepgram Flux, LiveKit semantic VAD) uses a transformer classifier trained on conversational turn boundaries; Deepgram Flux fires 150–250ms earlier per turn than pipeline approaches at the cost of 50–70% more total calls processed. Use STT endpointing as the default; add model-based detection after the rest of the pipeline is stable.

▸When should you use a speech-to-speech model instead of the cascade pipeline?

Speech-to-speech (S2S) models like OpenAI Realtime API or Moshi process audio tokens end-to-end without explicit STT and TTS stages, achieving 200ms theoretical latency versus 520ms best-case for a tuned cascade. S2S wins on raw latency and preserves prosody and emotion. But cascade wins on debuggability (text logs at every boundary), compliance (audit transcripts are a natural byproduct), HIPAA eligibility (OpenAI Realtime audio is explicitly excluded from the BAA), mature tool-calling, and provider flexibility. For regulated industries, start with cascade. For consumer products where emotional naturalness matters most, evaluate S2S in production — do not decide from benchmarks alone.

▸How do you handle barge-in correctly, and what breaks if you get it wrong?

Barge-in means the user speaks while the agent is talking. You detect this via VAD or ASR running in parallel on the inbound audio track (full-duplex processing is required — the agent must process user audio even while playing TTS output). On detection: send `response.cancel` to the LLM, close or flush the TTS WebSocket within 60ms, and drop (not drain) the queued jitter-buffer audio. If you drain instead of drop, the user hears the tail of the agent's previous sentence after interrupting — they feel ignored. Two harder cases: distinguish true interruptions from filler sounds like "uh-huh" using a confidence threshold; handle mid-tool-call barge-in where there is no active TTS to cancel, requiring the orchestrator to decide whether to abandon or surface the pending tool result.

▸What are the most important metrics to track for a production voice AI system?

Track Time to First Audio (TTFA) — from end-of-speech signal to first audio byte played — as your primary latency metric, not average call duration, which masks whether the agent is fast or slow. Also track: P50/P90/P95 TTFA broken down by STT, LLM, and TTS stage; barge-in false positive rate (how often the agent interrupts the user); tool call success rate (GPT-4o function calling is only 50% accurate on multi-turn voice benchmarks); and session context drift (median latency at turn 1 vs turn 20, which can diverge from 800ms to 2s+ on unbounded context). Layer in business metrics: first-call resolution rate, escalation rate, and sentiment at hang-up.

← previous

Design a Customer-Support AI Assistant

Design an AI Guardrails & Safety System

// RELATED