Hapi

Open Source Speech to Text on Mac: 2026 Comparison (Whisper, Parakeet, Vosk)

Practical guide to running open-source speech-to-text models on macOS — Whisper, Parakeet, Vosk, WhisperKit, MLX. Performance, accuracy, and when to pick which.

6 min read·macOS

The open-source speech-to-text landscape on Mac changed dramatically between 2022 and 2026. Whisper made high-quality on-device transcription a community-driven project. Apple Silicon made the hardware fast enough to run those models in real time. Runtimes like WhisperKit, whisper.cpp, and MLX-Whisper made the models actually usable on a Mac without a Python toolchain.

This guide covers what's available in 2026, where each option fits, and how to choose for your specific use case.

The Three Major Open-Source Speech-to-Text Families

1. Whisper (OpenAI, 2022)

The most widely adopted open-source speech-to-text model. Released by OpenAI in October 2022 with weights and code under MIT. Whisper handles 99 languages and is the default choice for most general transcription work.

VariantApprox file sizeStrength
Tiny75 MBFast, low-resource, decent for short English
Base142 MBGood for English, fast on any hardware
Small466 MBSolid multilingual baseline
Medium1.5 GBStrong multilingual, handles accents
Large-v32.9 GBBest accuracy, slower on older hardware
Distil-Whisper600 MB+Distilled variants for speed/quality trade-offs

Mac runtimes for Whisper:

  • WhisperKit (Argmax) — Swift Package optimized for CoreML + MLX
  • whisper.cpp (Georgi Gerganov) — C++ port, runs on any hardware
  • mlx-whisper — Apple's MLX framework, Mac-specific
  • whisperX — Adds word-level timestamps and diarization (Python)

2. Parakeet (NVIDIA, 2024-2025)

NVIDIA released its Parakeet RNN-T models with open weights starting in 2024. The 2025 multilingual releases brought Parakeet into competition with Whisper for general use, with a meaningful speed advantage.

Strengths:

  • Real-time factor often below 0.1× on Apple Silicon — meaning a 60-minute file transcribes in under 6 minutes
  • Streaming-first architecture — designed for low-latency dictation, not just batch transcription
  • Multilingual coverage has expanded steadily

Trade-offs:

  • Slightly less accurate than Whisper Large on adversarial benchmarks
  • Smaller community for fine-tunes and specialty vocabulary
  • Newer ecosystem — fewer Mac-native runtimes than Whisper

Hapi uses Parakeet specifically for its streaming dictation path because of the latency advantage.

3. Vosk (Alpha Cephei, ongoing)

The most mature pre-Whisper open-source speech-to-text family. Built on Kaldi-derived architecture rather than transformers. Lightweight, runs on tiny hardware, supports many languages with separately-trained models.

When Vosk fits:

  • Embedded systems with severe memory constraints
  • Offline keyword-spotting and command-recognition use cases
  • Languages where Whisper/Parakeet have weak coverage
  • Stability-critical pipelines that benefit from a simpler architecture

When it doesn't:

  • General-purpose transcription where accuracy matters more than footprint — Whisper/Parakeet are clearly better

Performance on Apple Silicon

Real-world numbers from Mac benchmarks in 2025-2026:

ModelRuntimeM1 ProM3 Max
Whisper TinyWhisperKit~0.05×~0.02×
Whisper BaseWhisperKit~0.07×~0.03×
Whisper MediumWhisperKit~0.25×~0.10×
Whisper Large-v3WhisperKit~0.6×~0.25×
ParakeetNVIDIA NeMo / Hapi-style~0.05×~0.02×
Whisper via PyTorch MPSPyTorch1.0× – 2.0×0.5× – 1.0×

The takeaway: Apple-Silicon-native runtimes are 5-10× faster than naive PyTorch ports. If you're going to run open-source speech-to-text on a Mac, picking the right runtime matters as much as picking the right model.

How to Run Each Option

Whisper via WhisperKit

import WhisperKit
let pipe = try await WhisperKit()
let result = try await pipe.transcribe(audioPath: "audio.wav")
print(result?.text ?? "")

Distribute as a Swift Package; ship as part of any Mac app. This is what most polished Mac transcription apps use.

Whisper via whisper.cpp

brew install whisper-cpp
whisper-cli -m models/ggml-medium.bin -f audio.wav

Cross-platform, no Swift required. Slightly slower than WhisperKit on Mac but works on Linux and Windows too.

Parakeet via NeMo

NVIDIA's NeMo toolkit ships Parakeet with PyTorch + ONNX export paths. For Mac use, ONNX with CoreMLExecutionProvider produces good results; raw PyTorch with MPS works but is slower.

Vosk via Python

from vosk import Model, KaldiRecognizer
model = Model("vosk-model-small-en-us-0.15")
rec = KaldiRecognizer(model, 16000)
# feed audio chunks...

Lightweight, runs on small models, useful when memory is a hard constraint.

When to Build vs. Use a Packaged App

Direct integration is the right answer when:

  • You're shipping a product — own your stack, control updates
  • You need custom inference paths — domain-specific fine-tunes, batch optimization
  • You're doing research — reproducibility, version control, ablations
  • You have unusual privacy or compliance requirements that need explicit chain-of-custody auditing

A packaged Mac app is the right answer when:

  • You're an end user who wants a working dictation/transcription tool — UX matters more than tinkering
  • You want meeting capture, hotkeys, and formatting — these are real engineering problems beyond the model
  • You want updates handled — model improvements, runtime optimizations, OS compatibility

How Hapi Uses Open-Source Models

Hapi bundles open-source models in a polished Mac UX:

  • Streaming dictation — Parakeet for sub-2-second voice notes
  • Batch / meeting transcription — WhisperKit for high-accuracy multilingual transcription
  • Speaker diarization — ECAPA-TDNN embeddings + WeSpeaker clustering
  • Language detection — automatic per segment via the multilingual models
  • Local LLM — Qwen-class on-device for summarization and chat

All of this runs on the Mac's Neural Engine, GPU, and CPU coordinated via Apple's CoreML and MLX. The user experience is a hotkey and a menu bar; the engineering is a multi-stage pipeline of open-source components.

The Bigger Picture

Open-source speech-to-text on Mac in 2026 is a solved-enough problem that most users no longer need to think about which model is running. Whisper and Parakeet, packaged via WhisperKit or comparable runtimes, deliver cloud-equivalent accuracy at real-time speed without sending audio anywhere.

For developers, this is an unusually rich open-source ecosystem to build on. For end users, it means privacy-respecting transcription is finally a default rather than a compromise.

For more on the underlying runtime, see our What is WhisperKit explainer. For a developer-focused comparison of WhisperX, see our WhisperX vs alternatives guide.

Related