System Architecture
Phonx AI
Production voice AI for US insurance — sub-second turn-taking, compliance-aware reasoning, CRM-native execution.
Project Overview
Phonx AI is a voice agent system built for US insurance agencies running high-volume outbound enrollment and inbound policyholder support. It places and receives calls through Twilio, runs real-time speech understanding and reasoning under a one-second turn budget, and writes outcomes back into the agency's CRM as structured events — not transcripts.
The system is designed around a hard constraint that defines voice AI in this market: a human caller will hang up if the agent feels slow, robotic, or evasive on regulated topics. Every architectural decision in Phonx — from the choice of inference provider to how state is held across barge-ins — is a decision about preserving that one-second budget without losing accuracy on insurance-specific reasoning.
The one-second turn budget (end-of-utterance → first TTS audio frame) is the single number every architectural decision is tested against. Inference provider, retrieval strategy, barge-in handling — all chosen to protect that margin.
Today it handles outbound enrollment booking flows in production and is moving inbound qualification toward pilot.
The Problem
US insurance agencies operate under three pressures that legacy IVR and offshore call centres can't resolve simultaneously:
Open Enrollment Periods compress months of demand into weeks. Headcount can't flex that fast, and missed calls are missed policies.
Every statement an agent makes about coverage, eligibility, or benefits is a potential audit finding. Scripts drift, agents improvise, and quality monitoring is sampled — not exhaustive.
A call that doesn't land in the CRM as a structured outcome — booked appointment, qualified lead, opted-out contact — effectively didn't happen. Most voice tools produce transcripts; agencies need state changes.
Phonx is built to absorb the volume, hold the compliance line on every call, and write directly into the workflows agencies already run.
What We Built
A voice agent system structured as four cooperating layers, each with its own latency budget and failure mode:
Audio moves between the caller and the system over Twilio's Media Streams, with Krisp running noise suppression and acoustic turn-taking on the inbound leg before any model sees the audio.
Real-time speech and language inference runs ASR, LLM reasoning, and TTS on a streaming pipeline. Groq handles both Whisper-based STT and LLM inference where token-level latency is the constraint; ElevenLabs handles voice synthesis where naturalness under interruption matters.
Conversation state and domain knowledge live across Redis, for hot session state and fast reads during a turn, and Neo4j, for the insurance knowledge graph. The graph encodes plan structures, eligibility rules, and the relationships between products, carriers, and regulatory boundaries so the agent's answers are grounded in domain logic, not just retrieved text.
Structured outcomes write back into GoHighLevel today, with persistent records in Postgres-on-EC2. CRM sync runs out-of-band so the conversation never blocks on it. The four layers are deliberately decoupled — telephony failures don't corrupt state, model latency spikes don't break call control.
System Architecture
Twilio absorbs carrier-side complexity — DID provisioning, STIR/SHAKEN, regional routing. Krisp runs noise suppression and acoustic turn-taking before STT, because residual noise on US PSTN calls degrades transcription accuracy more than it degrades human comprehension.
PSTN audio arrives at 8 kHz and is upsampled to 16 kHz before transcription, since Whisper expects 16 kHz. Streaming, not batch: partial transcripts feed the conversation manager so the system reasons in parallel instead of waiting for the full utterance.
The orchestration layer. Holds turn state, manages barge-in when a caller starts speaking mid-response, consumes the acoustic turn-taking signal from Krisp, and decides when to commit a partial response to TTS versus wait. Fallbacks and timeouts are first-class.
Groq is chosen for token throughput at low latency. For a voice agent, the cost ceiling is not just tokens-per-dollar — it is tokens-per-second under a one-second budget. Reasoning prompts stay narrow and grounded; insurance-specific knowledge comes from retrieval, not prompt bloat.
Streamed back over the Twilio media channel. Voice choice and prosody matter for trust on insurance calls — the agent is not trying to pass as human, but it cannot sound like an old IVR either.
The proprietary asset. Plans, carriers, eligibility rules, and regulatory constraints are modelled as a graph with vector embeddings on relevant nodes. A question like "does this plan cover insulin pumps under Part D" resolves through graph traversal and semantic match — not one or the other.
Redis holds live session state: turn history, extracted entities, and compliance flags. Postgres-on-EC2 holds persistent call records. Once the call closes, a separate worker writes structured outcomes into GoHighLevel. CRM writes are async, retryable, and never block the caller experience.
The hard problems behind the one-second budget.
Holding a one-second turn budget end-to-end.
Total latency = audio capture + STT + end-of-utterance detection + LLM + TTS + audio playback. Each component has a budget; none can take the whole pie. Groq inference and streaming STT partials create most of the headroom. The remaining margin is spent on retrieval against the knowledge graph, which has its own timeout and degraded fallback using a smaller cached subgraph.
Barge-in without context loss.
When a caller interrupts mid-response, the system has to stop TTS playback quickly, decide whether the new utterance replaces or refines the prior turn, and preserve the partial response so the agent does not repeat itself if the interruption was just a backchannel like "uh-huh" or "right". This is handled in the conversation manager, not outsourced to the LLM.
Compliance as a runtime constraint, not a post-hoc filter.
Insurance regulations restrict what an agent can say about coverage, premiums, and eligibility. The constraint is encoded as edges in the knowledge graph, connecting benefit claims to the regulatory boundaries that govern them. The agent does not generate a claim and then filter it; retrieval is constrained before generation.
State recovery on dropped calls.
PSTN calls drop. When a call disconnects mid-flow, Redis holds session state long enough that a callback can resume from the last completed turn rather than starting over. This matters most in enrollment flows, where re-collecting fields kills conversion.
CRM synchronisation under failure.
GoHighLevel's API can hit rate limits and latency spikes. The CRM connector queues outcome events in Redis and replays them with exponential backoff, with a dead-letter path for repeated failures. The call experience never depends on CRM availability in real time.
Technologies Used
Outcomes
Measured from end-of-utterance to first TTS audio frame, p50 across production outbound calls.
Currently handling open enrollment booking flows for partner agencies in the US insurance market.
Qualification and routing flows in active development, scheduled for production rollout following pilot validation.
From Interest to Delivery
Map current intents, volume distribution, and compliance boundaries against the existing agency stack.
Deploy on a single high-volume flow, typically enrollment booking, measured against the agency's existing baseline.
Full integration with the agency's CRM and telephony, with compliance logging and outcome tracking.