AI voice agents are no longer science fiction. Over the past year I shipped two production calling platforms — NextGen Calling and DialexAI — that handle inbound and outbound calls autonomously. This article walks through the core architecture: how audio flows from Twilio into a transcription pipeline, how the agent makes decisions, and the hard-won lessons from running this in production.
Why voice, and why now?
The convergence of three things made AI calling practical in 2024: Twilio's real-time media streaming API, sub-300ms speech-to-text models, and LLMs that can be constrained to a decision tree. None of these are new individually — but together they cross the threshold where latency is acceptable in a real conversation.
Architecture overview
The stack has four layers: Twilio handles the PSTN call and streams raw audio over WebSocket. FastAPI receives the media stream and runs it through a transcription service. A Redis-backed agent context window feeds each transcript chunk to the LLM decision loop. The loop produces a text response that gets converted to speech via a TTS provider and streamed back.
# Simplified Twilio media stream handler
@app.websocket("/media-stream/{call_sid}")
async def media_stream(ws: WebSocket, call_sid: str):
await ws.accept()
agent = await AgentService.get_by_call(call_sid)
transcriber = RealtimeTranscriber()
async for message in ws.iter_text():
event = json.loads(message)
if event["event"] != "media":
continue
audio_chunk = base64.b64decode(event["media"]["payload"])
transcript = await transcriber.feed(audio_chunk)
if transcript:
response = await agent.process(transcript)
await stream_tts_to_caller(ws, call_sid, response)Setting up Twilio media streams
Twilio's media stream sends μ-law encoded audio at 8kHz over a WebSocket connection. The first thing I had to handle was codec conversion — most transcription APIs expect 16kHz linear PCM. I used the `audioop` standard library for the conversion, which adds negligible latency.
import audioop, base64
def decode_twilio_audio(payload: str) -> bytes:
mulaw_audio = base64.b64decode(payload)
# Convert 8kHz μ-law → 16kHz linear PCM
pcm_8k = audioop.ulaw2lin(mulaw_audio, 2)
pcm_16k, _ = audioop.ratecv(pcm_8k, 2, 1, 8000, 16000, None)
return pcm_16kOne subtle issue: Twilio sends audio in small 20ms chunks. Most STT APIs perform better with larger chunks. I buffer 300ms of audio before sending to the transcription endpoint — this noticeably improved accuracy on accented speech.
The agent decision loop
Each agent has a system prompt, a knowledge base, and a set of callable actions (transfer call, book appointment, send SMS). The LLM receives the full conversation history on each turn and returns either a spoken response or a structured action payload.
async def process(self, user_turn: str) -> str:
self.history.append({"role": "user", "content": user_turn})
response = await openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": self.system_prompt},
*self.history,
],
tools=self.available_tools,
tool_choice="auto",
max_tokens=150,
temperature=0.4,
)
choice = response.choices[0]
if choice.finish_reason == "tool_calls":
return await self._execute_tool(choice.message.tool_calls[0])
reply = choice.message.content
self.history.append({"role": "assistant", "content": reply})
return replyTemperature 0.4 is intentional — it reduces hallucination while keeping responses natural. Above 0.6 the agent starts making things up, which is catastrophic in a business context. Below 0.2 it sounds robotic.
Redis for call state
Call state — the conversation history, agent config, and any collected data — lives in Redis with a TTL matching the maximum call duration. This makes horizontal scaling trivial: any FastAPI instance can pick up any call because state is external.
# Store agent context in Redis
await redis.setex(
f"call:{call_sid}:context",
ex=3600, # 1 hour TTL
value=agent_context.model_dump_json(),
)Production gotchas
A few things burned me in production that I didn't anticipate in development:
- Silence detection matters. Users pause mid-sentence. Without a voice activity detector, the agent interrupts constantly. I added a 600ms silence threshold before triggering the LLM.
- LLM latency spikes kill calls. On a bad day GPT-4o-mini takes 1.5s. Users interpret anything above 2s total as a dead line. Add a verbal filler ("Let me check that for you...") to buy time while the LLM is thinking.
- Twilio drops WebSocket connections after 4 hours. For long-running inbound lines (customer support), you need reconnection logic with state recovery from Redis.
- Tool calls must be idempotent. The LLM occasionally calls the same tool twice in quick succession. Deduplication by call_sid + tool + timestamp prevents double-bookings.
What's next
The next frontier is interruption handling — allowing callers to cut off the agent mid-sentence, just like a human conversation. Twilio's bidirectional streaming makes this possible but the state management is non-trivial. I'll cover that in a follow-up post.