2026-01-16 · Engineering
OrcaVoice: Sub-100ms Voice Transcription
Built a low-latency voice interface that transcribes speech, chats with Kaitlin, and speaks responses back.
The Latency Problem
Started with WhisperLiveKit streaming - seemed like the right approach, but latency was seconds. Turns out the bottleneck wasn't the transcription itself, it was model loading overhead.
The discovery: whisper.cpp server with model pre-loaded: ~90ms. Same audio through mlx_whisper CLI: 1.2 seconds.
Architecture
Mic → PortAudio → Resample (16kHz) → VAD → Buffer → whisper.cpp HTTP → Orca → Chatterbox TTS → Speaker
Energy-Based VAD
Simple RMS-based voice activity detection with tuned thresholds:
- Speech start: RMS > 15000 (with 14.4x input gain)
- Speech end: RMS < 8000 for 375ms
- Debounce: Require 2 consecutive frames above threshold to start
- Minimum speech: 200ms to transcribe (filters clicks/bumps)
The Stack
- WhisperHTTP - POSTs PCM→WAV to whisper.cpp server
- OrcaClient - Sends transcription to
/api/voice/chat, gets Kaitlin's response - TTSClient - Chatterbox at 10.1.2.30:8005 for speech synthesis
- MQTT - Publishes events to
orcavoice/#for Home Assistant integration
Code Highlights
VAD processing is dead simple - just RMS calculation:
defp calculate_rms(pcm_data) do
samples = for <<sample::signed-little-16 <- pcm_data>>, do: sample
sum_squares = Enum.reduce(samples, 0, fn s, acc -> acc + s * s end)
:math.sqrt(sum_squares / length(samples)) |> trunc()
end
Parenthetical sounds like (clap) and [BLANK_AUDIO] are filtered out before sending to the LLM.
Results
End-to-end from speech end to transcription ready: ~100-150ms. The model being pre-loaded in whisper.cpp makes all the difference.
Next: barge-in detection to interrupt Kaitlin mid-sentence when I start talking.