Realtime voice that never talks over you.

← Overview

noise

caller

coworker

02 · Voice AI

From the phone in a hand to the model and back — supervised, the whole way on one line.

OpenAI RealtimeGemini LiveElevenLabsAnthropicDeepgram

Bring your own model

Swap providers without rewiring. The same call runs on every major realtime voice model — and you can change your mind in production.

Barge-in that actually works

The moment a caller starts speaking, the agent stops. Instantly, and the same way on every provider. No awkward overlaps, no waiting for a sentence to finish.

Perception, on the line

It tells the caller from the TV, a coworker, or background noise — and decides on its own whether to mute, listen through, or clean up the audio. The caller just gets understood.

Agents that do things

Tools, memory and actions run alongside the audio, so the agent can act mid-call instead of only talking. One pipe carries speech up, audio down, and everything in between.

0★caller speaks to silence at the ear

0★voice providers, one integration

Voice AI · FAQ

Questions builders ask about realtime voice that never talks over you..

OpenAI Realtime, Gemini Live, ElevenLabs Conversational AI, and a self-hosted llama.cpp worker out of the box. For pipelined stacks (separate STT/LLM/TTS), wire any combination — Deepgram, AssemblyAI, Cartesia, Eleven Multilingual — the substrate handles the joins.

From the moment the caller's first voiced frame arrives to the moment agent audio stops: ~40-80 ms p50 on the same continent. Most of that is the speech-gate hangover (we wait one frame to confirm it's not a sneeze). The cancel itself is one packet over QUIC.

Speaker-agnostic on first hear — we enroll the dominant voice in the first ~3 seconds of voiced speech, then use it to gate. Approach is d-vector-based, in the lineage of WeSpeaker / 3D-Speaker. No upfront enrollment flow. For known-caller scenarios (your own employees), you can pre-enroll a print and ship it as a tenant secret.

VAD is a Silero V5-style small recurrent net, frame-rate, integrated into the speech-gate so it can feed barge-in directly. Environment classification (caller / TV / music / coworker / car) sits on top of YAMNet-family acoustic embeddings — that's what tells the agent "this is noise, not interruption" so it ignores a YouTube video next to the mic.

QUIC stream independence keeps audio rendering. The agent's next reply takes a beat longer to arrive but the conversation doesn't lock up. We expose per-call jitter/inter-arrival in the post-call summary so you can see the tail shape for your network.

Yes. On-prem deployments ship with the inference worker co-located — audio never leaves your VPC. The substrate handles transport + perception; the model is whatever binary you point it at.