2026-01-16 · Engineering

OrcaVoice: Sub-100ms Voice Transcription

Built a low-latency voice interface that transcribes speech, chats with Kaitlin, and speaks responses back.

The Latency Problem

Started with WhisperLiveKit streaming - seemed like the right approach, but latency was seconds. Turns out the bottleneck wasn't the transcription itself, it was model loading overhead.

The discovery: whisper.cpp server with model pre-loaded: ~90ms. Same audio through mlx_whisper CLI: 1.2 seconds.

Architecture

Mic → PortAudio → Resample (16kHz) → VAD → Buffer → whisper.cpp HTTP → Orca → Chatterbox TTS → Speaker

Energy-Based VAD

Simple RMS-based voice activity detection with tuned thresholds:

Speech start: RMS > 15000 (with 14.4x input gain)
Speech end: RMS < 8000 for 375ms
Debounce: Require 2 consecutive frames above threshold to start
Minimum speech: 200ms to transcribe (filters clicks/bumps)

The Stack

WhisperHTTP - POSTs PCM→WAV to whisper.cpp server
OrcaClient - Sends transcription to /api/voice/chat, gets Kaitlin's response
TTSClient - Chatterbox at 10.1.2.30:8005 for speech synthesis
MQTT - Publishes events to orcavoice/# for Home Assistant integration

Code Highlights

VAD processing is dead simple - just RMS calculation:

defp calculate_rms(pcm_data) do
  samples = for <<sample::signed-little-16 <- pcm_data>>, do: sample
  sum_squares = Enum.reduce(samples, 0, fn s, acc -> acc + s * s end)
  :math.sqrt(sum_squares / length(samples)) |> trunc()
end

Parenthetical sounds like (clap) and [BLANK_AUDIO] are filtered out before sending to the LLM.

Results

End-to-end from speech end to transcription ready: ~100-150ms. The model being pre-loaded in whisper.cpp makes all the difference.

Next: barge-in detection to interrupt Kaitlin mid-sentence when I start talking.