November 15, 2024·8 min read

Building an AI Voice Agent with FastAPI and Twilio

A deep dive into the architecture behind NextGen Calling — how I wired FastAPI, Twilio media streams, real-time transcription, and a decision loop into a production AI calling agent.

AI voice agents are no longer science fiction. Over the past year I shipped two production calling platforms — NextGen Calling and DialexAI — that handle inbound and outbound calls autonomously. This article walks through the core architecture: how audio flows from Twilio into a transcription pipeline, how the agent makes decisions, and the hard-won lessons from running this in production.

Why voice, and why now?

The convergence of three things made AI calling practical in 2024: Twilio's real-time media streaming API, sub-300ms speech-to-text models, and LLMs that can be constrained to a decision tree. None of these are new individually — but together they cross the threshold where latency is acceptable in a real conversation.

Architecture overview

The stack has four layers: Twilio handles the PSTN call and streams raw audio over WebSocket. FastAPI receives the media stream and runs it through a transcription service. A Redis-backed agent context window feeds each transcript chunk to the LLM decision loop. The loop produces a text response that gets converted to speech via a TTS provider and streamed back.

python
# Simplified Twilio media stream handler
@app.websocket("/media-stream/{call_sid}")
async def media_stream(ws: WebSocket, call_sid: str):
    await ws.accept()
    agent = await AgentService.get_by_call(call_sid)
    transcriber = RealtimeTranscriber()

    async for message in ws.iter_text():
        event = json.loads(message)
        if event["event"] != "media":
            continue

        audio_chunk = base64.b64decode(event["media"]["payload"])
        transcript = await transcriber.feed(audio_chunk)

        if transcript:
            response = await agent.process(transcript)
            await stream_tts_to_caller(ws, call_sid, response)

Setting up Twilio media streams

Twilio's media stream sends μ-law encoded audio at 8kHz over a WebSocket connection. The first thing I had to handle was codec conversion — most transcription APIs expect 16kHz linear PCM. I used the `audioop` standard library for the conversion, which adds negligible latency.

python
import audioop, base64

def decode_twilio_audio(payload: str) -> bytes:
    mulaw_audio = base64.b64decode(payload)
    # Convert 8kHz μ-law → 16kHz linear PCM
    pcm_8k = audioop.ulaw2lin(mulaw_audio, 2)
    pcm_16k, _ = audioop.ratecv(pcm_8k, 2, 1, 8000, 16000, None)
    return pcm_16k

One subtle issue: Twilio sends audio in small 20ms chunks. Most STT APIs perform better with larger chunks. I buffer 300ms of audio before sending to the transcription endpoint — this noticeably improved accuracy on accented speech.

The agent decision loop

Each agent has a system prompt, a knowledge base, and a set of callable actions (transfer call, book appointment, send SMS). The LLM receives the full conversation history on each turn and returns either a spoken response or a structured action payload.

python
async def process(self, user_turn: str) -> str:
    self.history.append({"role": "user", "content": user_turn})

    response = await openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": self.system_prompt},
            *self.history,
        ],
        tools=self.available_tools,
        tool_choice="auto",
        max_tokens=150,
        temperature=0.4,
    )

    choice = response.choices[0]
    if choice.finish_reason == "tool_calls":
        return await self._execute_tool(choice.message.tool_calls[0])

    reply = choice.message.content
    self.history.append({"role": "assistant", "content": reply})
    return reply

Temperature 0.4 is intentional — it reduces hallucination while keeping responses natural. Above 0.6 the agent starts making things up, which is catastrophic in a business context. Below 0.2 it sounds robotic.

Redis for call state

Call state — the conversation history, agent config, and any collected data — lives in Redis with a TTL matching the maximum call duration. This makes horizontal scaling trivial: any FastAPI instance can pick up any call because state is external.

python
# Store agent context in Redis
await redis.setex(
    f"call:{call_sid}:context",
    ex=3600,  # 1 hour TTL
    value=agent_context.model_dump_json(),
)

Production gotchas

A few things burned me in production that I didn't anticipate in development:

  • Silence detection matters. Users pause mid-sentence. Without a voice activity detector, the agent interrupts constantly. I added a 600ms silence threshold before triggering the LLM.
  • LLM latency spikes kill calls. On a bad day GPT-4o-mini takes 1.5s. Users interpret anything above 2s total as a dead line. Add a verbal filler ("Let me check that for you...") to buy time while the LLM is thinking.
  • Twilio drops WebSocket connections after 4 hours. For long-running inbound lines (customer support), you need reconnection logic with state recovery from Redis.
  • Tool calls must be idempotent. The LLM occasionally calls the same tool twice in quick succession. Deduplication by call_sid + tool + timestamp prevents double-bookings.

What's next

The next frontier is interruption handling — allowing callers to cut off the agent mid-sentence, just like a human conversation. Twilio's bidirectional streaming makes this possible but the state management is non-trivial. I'll cover that in a follow-up post.

Written by
Nasir Hussain
Sr. Full-Stack Developer
Work Together