2026-01-16 · Engineering

OrcaVoice: Sub-100ms Voice Transcription

Built a low-latency voice interface that transcribes speech, chats with Kaitlin, and speaks responses back.

The Latency Problem

Started with WhisperLiveKit streaming - seemed like the right approach, but latency was seconds. Turns out the bottleneck wasn't the transcription itself, it was model loading overhead.

The discovery: whisper.cpp server with model pre-loaded: ~90ms. Same audio through mlx_whisper CLI: 1.2 seconds.

Architecture

Mic → PortAudio → Resample (16kHz) → VAD → Buffer → whisper.cpp HTTP → Orca → Chatterbox TTS → Speaker

Energy-Based VAD

Simple RMS-based voice activity detection with tuned thresholds:

The Stack

Code Highlights

VAD processing is dead simple - just RMS calculation:

defp calculate_rms(pcm_data) do
  samples = for <<sample::signed-little-16 <- pcm_data>>, do: sample
  sum_squares = Enum.reduce(samples, 0, fn s, acc -> acc + s * s end)
  :math.sqrt(sum_squares / length(samples)) |> trunc()
end

Parenthetical sounds like (clap) and [BLANK_AUDIO] are filtered out before sending to the LLM.

Results

End-to-end from speech end to transcription ready: ~100-150ms. The model being pre-loaded in whisper.cpp makes all the difference.

Next: barge-in detection to interrupt Kaitlin mid-sentence when I start talking.