Building an AI Voice Agent with FastAPI and Twilio

AI voice agents are no longer science fiction. Over the past year I shipped two production calling platforms — NextGen Calling and DialexAI — that handle inbound and outbound calls autonomously. This article walks through the core architecture: how audio flows from Twilio into a transcription pipeline, how the agent makes decisions, and the hard-won lessons from running this in production.

Why voice, and why now?

The convergence of three things made AI calling practical in 2024: Twilio's real-time media streaming API, sub-300ms speech-to-text models, and LLMs that can be constrained to a decision tree. None of these are new individually — but together they cross the threshold where latency is acceptable in a real conversation.

Architecture overview

The stack has four layers: Twilio handles the PSTN call and streams raw audio over WebSocket. FastAPI receives the media stream and runs it through a transcription service. A Redis-backed agent context window feeds each transcript chunk to the LLM decision loop. The loop produces a text response that gets converted to speech via a TTS provider and streamed back.

python

# Simplified Twilio media stream handler
@app.websocket("/media-stream/{call_sid}")
async def media_stream(ws: WebSocket, call_sid: str):
    await ws.accept()
    agent = await AgentService.get_by_call(call_sid)
    transcriber = RealtimeTranscriber()

    async for message in ws.iter_text():
        event = json.loads(message)
        if event["event"] != "media":
            continue

        audio_chunk = base64.b64decode(event["media"]["payload"])
        transcript = await transcriber.feed(audio_chunk)

        if transcript:
            response = await agent.process(transcript)
            await stream_tts_to_caller(ws, call_sid, response)

Setting up Twilio media streams

Twilio's media stream sends μ-law encoded audio at 8kHz over a WebSocket connection. The first thing I had to handle was codec conversion — most transcription APIs expect 16kHz linear PCM. I used the `audioop` standard library for the conversion, which adds negligible latency.

python

import audioop, base64

def decode_twilio_audio(payload: str) -> bytes:
    mulaw_audio = base64.b64decode(payload)
    # Convert 8kHz μ-law → 16kHz linear PCM
    pcm_8k = audioop.ulaw2lin(mulaw_audio, 2)
    pcm_16k, _ = audioop.ratecv(pcm_8k, 2, 1, 8000, 16000, None)
    return pcm_16k

One subtle issue: Twilio sends audio in small 20ms chunks. Most STT APIs perform better with larger chunks. I buffer 300ms of audio before sending to the transcription endpoint — this noticeably improved accuracy on accented speech.

The agent decision loop

Each agent has a system prompt, a knowledge base, and a set of callable actions (transfer call, book appointment, send SMS). The LLM receives the full conversation history on each turn and returns either a spoken response or a structured action payload.

python

async def process(self, user_turn: str) -> str:
    self.history.append({"role": "user", "content": user_turn})

    response = await openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": self.system_prompt},
            *self.history,
        ],
        tools=self.available_tools,
        tool_choice="auto",
        max_tokens=150,
        temperature=0.4,
    )

    choice = response.choices[0]
    if choice.finish_reason == "tool_calls":
        return await self._execute_tool(choice.message.tool_calls[0])

    reply = choice.message.content
    self.history.append({"role": "assistant", "content": reply})
    return reply

Temperature 0.4 is intentional — it reduces hallucination while keeping responses natural. Above 0.6 the agent starts making things up, which is catastrophic in a business context. Below 0.2 it sounds robotic.

Redis for call state

Call state — the conversation history, agent config, and any collected data — lives in Redis with a TTL matching the maximum call duration. This makes horizontal scaling trivial: any FastAPI instance can pick up any call because state is external.

python

# Store agent context in Redis
await redis.setex(
    f"call:{call_sid}:context",
    ex=3600,  # 1 hour TTL
    value=agent_context.model_dump_json(),
)

Production gotchas

A few things burned me in production that I didn't anticipate in development:

Silence detection matters. Users pause mid-sentence. Without a voice activity detector, the agent interrupts constantly. I added a 600ms silence threshold before triggering the LLM.
LLM latency spikes kill calls. On a bad day GPT-4o-mini takes 1.5s. Users interpret anything above 2s total as a dead line. Add a verbal filler ("Let me check that for you...") to buy time while the LLM is thinking.
Twilio drops WebSocket connections after 4 hours. For long-running inbound lines (customer support), you need reconnection logic with state recovery from Redis.
Tool calls must be idempotent. The LLM occasionally calls the same tool twice in quick succession. Deduplication by call_sid + tool + timestamp prevents double-bookings.

What's next

The next frontier is interruption handling — allowing callers to cut off the agent mid-sentence, just like a human conversation. Twilio's bidirectional streaming makes this possible but the state management is non-trivial. I'll cover that in a follow-up post.