Open Source Speech to Text on Mac: 2026 Comparison (Whisper, Parakeet, Vosk)
Practical guide to running open-source speech-to-text models on macOS — Whisper, Parakeet, Vosk, WhisperKit, MLX. Performance, accuracy, and when to pick which.
The open-source speech-to-text landscape on Mac changed dramatically between 2022 and 2026. Whisper made high-quality on-device transcription a community-driven project. Apple Silicon made the hardware fast enough to run those models in real time. Runtimes like WhisperKit, whisper.cpp, and MLX-Whisper made the models actually usable on a Mac without a Python toolchain.
This guide covers what's available in 2026, where each option fits, and how to choose for your specific use case.
The Three Major Open-Source Speech-to-Text Families
1. Whisper (OpenAI, 2022)
The most widely adopted open-source speech-to-text model. Released by OpenAI in October 2022 with weights and code under MIT. Whisper handles 99 languages and is the default choice for most general transcription work.
| Variant | Approx file size | Strength |
|---|---|---|
| Tiny | 75 MB | Fast, low-resource, decent for short English |
| Base | 142 MB | Good for English, fast on any hardware |
| Small | 466 MB | Solid multilingual baseline |
| Medium | 1.5 GB | Strong multilingual, handles accents |
| Large-v3 | 2.9 GB | Best accuracy, slower on older hardware |
| Distil-Whisper | 600 MB+ | Distilled variants for speed/quality trade-offs |
Mac runtimes for Whisper:
- WhisperKit (Argmax) — Swift Package optimized for CoreML + MLX
- whisper.cpp (Georgi Gerganov) — C++ port, runs on any hardware
- mlx-whisper — Apple's MLX framework, Mac-specific
- whisperX — Adds word-level timestamps and diarization (Python)
2. Parakeet (NVIDIA, 2024-2025)
NVIDIA released its Parakeet RNN-T models with open weights starting in 2024. The 2025 multilingual releases brought Parakeet into competition with Whisper for general use, with a meaningful speed advantage.
Strengths:
- Real-time factor often below 0.1× on Apple Silicon — meaning a 60-minute file transcribes in under 6 minutes
- Streaming-first architecture — designed for low-latency dictation, not just batch transcription
- Multilingual coverage has expanded steadily
Trade-offs:
- Slightly less accurate than Whisper Large on adversarial benchmarks
- Smaller community for fine-tunes and specialty vocabulary
- Newer ecosystem — fewer Mac-native runtimes than Whisper
Hapi uses Parakeet specifically for its streaming dictation path because of the latency advantage.
3. Vosk (Alpha Cephei, ongoing)
The most mature pre-Whisper open-source speech-to-text family. Built on Kaldi-derived architecture rather than transformers. Lightweight, runs on tiny hardware, supports many languages with separately-trained models.
When Vosk fits:
- Embedded systems with severe memory constraints
- Offline keyword-spotting and command-recognition use cases
- Languages where Whisper/Parakeet have weak coverage
- Stability-critical pipelines that benefit from a simpler architecture
When it doesn't:
- General-purpose transcription where accuracy matters more than footprint — Whisper/Parakeet are clearly better
Performance on Apple Silicon
Real-world numbers from Mac benchmarks in 2025-2026:
| Model | Runtime | M1 Pro | M3 Max |
|---|---|---|---|
| Whisper Tiny | WhisperKit | ~0.05× | ~0.02× |
| Whisper Base | WhisperKit | ~0.07× | ~0.03× |
| Whisper Medium | WhisperKit | ~0.25× | ~0.10× |
| Whisper Large-v3 | WhisperKit | ~0.6× | ~0.25× |
| Parakeet | NVIDIA NeMo / Hapi-style | ~0.05× | ~0.02× |
| Whisper via PyTorch MPS | PyTorch | 1.0× – 2.0× | 0.5× – 1.0× |
The takeaway: Apple-Silicon-native runtimes are 5-10× faster than naive PyTorch ports. If you're going to run open-source speech-to-text on a Mac, picking the right runtime matters as much as picking the right model.
How to Run Each Option
Whisper via WhisperKit
import WhisperKit
let pipe = try await WhisperKit()
let result = try await pipe.transcribe(audioPath: "audio.wav")
print(result?.text ?? "")
Distribute as a Swift Package; ship as part of any Mac app. This is what most polished Mac transcription apps use.
Whisper via whisper.cpp
brew install whisper-cpp
whisper-cli -m models/ggml-medium.bin -f audio.wav
Cross-platform, no Swift required. Slightly slower than WhisperKit on Mac but works on Linux and Windows too.
Parakeet via NeMo
NVIDIA's NeMo toolkit ships Parakeet with PyTorch + ONNX export paths. For Mac use, ONNX with CoreMLExecutionProvider produces good results; raw PyTorch with MPS works but is slower.
Vosk via Python
from vosk import Model, KaldiRecognizer
model = Model("vosk-model-small-en-us-0.15")
rec = KaldiRecognizer(model, 16000)
# feed audio chunks...
Lightweight, runs on small models, useful when memory is a hard constraint.
When to Build vs. Use a Packaged App
Direct integration is the right answer when:
- You're shipping a product — own your stack, control updates
- You need custom inference paths — domain-specific fine-tunes, batch optimization
- You're doing research — reproducibility, version control, ablations
- You have unusual privacy or compliance requirements that need explicit chain-of-custody auditing
A packaged Mac app is the right answer when:
- You're an end user who wants a working dictation/transcription tool — UX matters more than tinkering
- You want meeting capture, hotkeys, and formatting — these are real engineering problems beyond the model
- You want updates handled — model improvements, runtime optimizations, OS compatibility
How Hapi Uses Open-Source Models
Hapi bundles open-source models in a polished Mac UX:
- Streaming dictation — Parakeet for sub-2-second voice notes
- Batch / meeting transcription — WhisperKit for high-accuracy multilingual transcription
- Speaker diarization — ECAPA-TDNN embeddings + WeSpeaker clustering
- Language detection — automatic per segment via the multilingual models
- Local LLM — Qwen-class on-device for summarization and chat
All of this runs on the Mac's Neural Engine, GPU, and CPU coordinated via Apple's CoreML and MLX. The user experience is a hotkey and a menu bar; the engineering is a multi-stage pipeline of open-source components.
The Bigger Picture
Open-source speech-to-text on Mac in 2026 is a solved-enough problem that most users no longer need to think about which model is running. Whisper and Parakeet, packaged via WhisperKit or comparable runtimes, deliver cloud-equivalent accuracy at real-time speed without sending audio anywhere.
For developers, this is an unusually rich open-source ecosystem to build on. For end users, it means privacy-respecting transcription is finally a default rather than a compromise.
For more on the underlying runtime, see our What is WhisperKit explainer. For a developer-focused comparison of WhisperX, see our WhisperX vs alternatives guide.
Related

