A year with voice agents

We shipped our first voice agent in January 2025. It was unusable. Not because it said the wrong things — it said reasonable things. It was unusable because the 1.4-second gap between “hello?” and “hi, this is Helix…” was long enough for every human on the line to hang up. Voice AI's fundamental problem is that humans can hearthe latency. They can't see it in chat; they hear it in voice, and they judge you for it.

Chat lets you think. Voice doesn't. Every millisecond of silence is a vote against you.

Latency is the whole game

Our pipeline at the start: ASR → LLM → TTS → audio. Each stage was optimized, sequential, and the total was 1,400ms. Over a year we got it to 340ms, and the difference in customer satisfaction between those two numbers is larger than any other change we've ever shipped. Here's what actually helped:

Streaming ASR, streaming LLM, streaming TTS. Obvious in hindsight. The LLM starts generating before the user is done speaking. The TTS starts speaking before the LLM is done generating. Every stage overlaps.
Smaller models for simple turns.Not every reply needs the frontier model. “Okay, one moment” is a 20ms decision.
Predictive warmup. When the user starts a sentence, we guess what tools might be needed and warm them up. If we guess wrong, we throw the work away.

The interrupt problem

The second hardest problem is letting the user interrupt. A real conversation is 30% the second person starting to talk before the first is done. Early voice agents can't do this — they finish their sentence, then listen. It sounds robotic because it is.

We built what we call graceful yield: if the user starts speaking mid-response, we immediately fade the TTS over 150ms and start listening. The model receives: “you were saying X, the user interrupted with Y, respond to Y while acknowledging you were cut off.” It's not perfect, but it's unrecognizable from the old “please let me finish” experience.

typescript

// Simplified interrupt handler
onUserVAD((userSpeaking) => {
  if (userSpeaking && tts.isPlaying) {
    tts.fadeOut(150);
    transcript.mark('INTERRUPTED_AT', ttsProgress);
    llm.prepareForFollowup({ wasInterrupted: true });
  }
});

Turn-taking is hard

The question “is the user done speaking, or just pausing to think?” is the single most subtle problem in voice. Too eager and you talk over them. Too patient and the conversation feels dead. We went through five approaches:

Fixed timeout (500ms). Too eager for slow speakers, too patient for fast ones.
VAD silence detection.Better. Still wrong on people who say “umm” a lot.
Pause classifier.ML model predicting “end of turn” vs “thinking pause.” Much better. Still not human.
Acoustic + semantic.Combine pause classifier with “does this sentence sound complete?” Feels close to human.
Per-caller calibration.Learn from each caller's speaking rhythm within the first 20 seconds. This one shipped.

Voice needs different safety

Safety in text is about content. Safety in voice is about content andpersona. People trust voices in a way they don't trust text — the UI signals are missing. This matters for:

Identity disclosure.Every call announces “this is an AI assistant” in the first 3 seconds. Non-negotiable.
Escalation. Sensitive topics (medical, legal, financial commitments, emotional distress) route to humans. Classifier runs in parallel with the response.
Recording notice. We record everything for coaching. We say so.

The line we won't cross

We won't let a voice agent impersonate a specific human, even the caller's own agent. Voice cloning is technically easy and socially toxic.

What's next

The next frontier is emotional presence— not faking emotion, but responding to the caller's. When someone's frustrated, the agent should slow down, drop formality, and offer a human. When someone's happy, it can match the energy. We're shipping early versions of this now; the hard part is doing it without it feeling manipulative.

Voice AI has gone from “this is embarrassing” to “this is useful” in 18 months. The gap between useful and indistinguishable feels more like physics than software — every decibel of improvement gets harder. But every decibel also matters more, because humans really can hear the difference.

A year with voice agents

Latency is the whole game

The interrupt problem

Turn-taking is hard

Voice needs different safety

What's next

More like this

Meet your
new teammates.

A year with voice agents

Latency is the whole game

The interrupt problem

Turn-taking is hard

Voice needs different safety

What's next

More like this

Meet yournew teammates.

Meet your
new teammates.