Hapi

What Is WhisperKit? Apple Silicon Whisper Inference Explained (2026)

WhisperKit is Argmax's open-source Swift framework for running Whisper speech-to-text on Apple Silicon. Here's what it is, how it works, and why Mac apps use it.

5 min read·macOS

WhisperKit is one of the more important pieces of infrastructure in the modern Mac speech-to-text ecosystem. If you've used a Mac transcription or dictation app that runs Whisper locally — without sending audio to the cloud — there's a good chance WhisperKit is doing the work under the hood. This guide explains what it is, how it works, and what the practical implications are for users and developers.

The Problem WhisperKit Solves

OpenAI released Whisper in 2022 as an open-weights speech-to-text model. The reference implementation is in Python and PyTorch — fine for cloud servers, slow and memory-heavy on a typical Mac. Several community efforts ported Whisper to faster runtimes (whisper.cpp for CPU, MLX for Apple Silicon, ONNX, ggml). Each addressed pieces of the problem.

WhisperKit, from Argmax, takes the specific bet that Apple Silicon deserves a first-class native runtime. It compiles Whisper to a combination of Apple's CoreML (which targets the Neural Engine and a fixed-function ANE) and the MLX framework (which targets the unified memory + GPU). The result is a Swift Package that any Mac or iOS app can import.

What WhisperKit Does Internally

When an app uses WhisperKit, the runtime path roughly looks like this:

StageHardware targetWhat happens
Audio preprocessingCPU16 kHz mono PCM, 30-second windows, log-Mel spectrograms
EncoderNeural Engine (CoreML)Whisper's transformer encoder on the Mel spectrogram
DecoderNeural Engine + GPUAutoregressive token generation with KV cache
PostprocessingCPUToken-to-text decoding, timestamp alignment, language ID

The crucial design choice is routing different parts of inference to different Apple Silicon units. The encoder is heavily matrix-multiplication-bound and matches the Neural Engine's strengths. The decoder benefits from GPU parallelism for batch operations. CPU handles the orchestration. This split is what makes WhisperKit faster than naive ports.

Performance Profile

On Apple Silicon Macs, WhisperKit hits real-time factors well below 1.0 — meaning it transcribes a 60-minute file in less than 60 seconds — for most model sizes. Specific numbers from public benchmarks:

ModelApprox file sizeReal-time factor on M-series
Tiny75 MBbelow 0.05×
Base142 MBbelow 0.1×
Small466 MBbelow 0.2×
Medium1.5 GBbelow 0.5×
Large-v32.9 GBbelow 1.0× (varies)

For comparison, naive PyTorch Whisper without GPU is often 2-5× slower than real-time on the same hardware. The gap is real.

Why Apps Choose WhisperKit

A Mac app that wants on-device speech-to-text faces a build-vs-buy decision:

OptionProsCons
Plain Whisper.cppCross-platform, matureSlower on Apple Silicon than ANE-targeted runtimes
Roll your own CoreML conversionFull controlMonths of engineering, model-quality risk
WhisperKitOptimized for Apple Silicon, Swift Package, MIT licensedApple platforms only
Other engines (Parakeet, etc.)Different accuracy/speed trade-offsDifferent model family

For most Mac apps, WhisperKit is the right answer. Hapi uses it as one of its engine paths, alongside Parakeet for streaming dictation.

What This Means for End Users

If you're a Mac user evaluating speech-to-text apps, "powered by WhisperKit" tells you several things:

  • Audio stays on your Mac. WhisperKit is designed for on-device inference. An app using WhisperKit is architecturally a local-first product, not a cloud SaaS in disguise.
  • Quality is consistent across apps. Different apps using WhisperKit get the same underlying accuracy from the same Whisper model. Differentiation happens in the UX layer — formatting, hotkeys, meeting capture, language detection — not the raw transcription.
  • Hardware matters. WhisperKit is fastest on M2/M3/M4 with their improved Neural Engines. Older M1 chips work but produce noticeably longer transcription times for large models. Intel Macs are not supported.

How Hapi Uses WhisperKit

Hapi's transcription engine is a hybrid:

  • Streaming dictation uses Parakeet-class models for the lowest possible latency on short voice notes (~2 seconds end-to-end on Apple Silicon).
  • Batch / meeting transcription uses WhisperKit-class models for the highest accuracy on longer recordings, particularly for multilingual content and challenging audio conditions.
  • Diarization runs ECAPA-based speaker embeddings independently of WhisperKit's transcription path.
  • Language detection is automatic per segment, leveraging the multilingual capability of the underlying models.

The user experience hides all of this — you press a hotkey or join a meeting and the right engine runs. The architectural transparency matters because it's why every byte of audio stays on the Mac.

The Bigger Picture

WhisperKit is one piece of a broader shift. Five years ago, "speech-to-text on a Mac" meant uploading audio to a vendor's cloud. Today, the combination of capable open-weights models (Whisper, Parakeet), Apple Silicon's Neural Engine, and runtimes like WhisperKit means that for most use cases, the cloud is no longer required.

For sensitive content — healthcare, legal, journalism, regulated industries — that shift is more than a privacy nice-to-have. It's the difference between an architecturally compliant tool and a non-starter.

Bottom Line

WhisperKit is the de facto standard for running Whisper-family speech-to-text models on Apple Silicon. End users encounter it through the apps that use it; developers integrate it as a Swift Package; the practical effect is that on-device transcription on a Mac is now a competently solved problem.

For a deeper dive into why this matters for users, see our local speech-to-text guide.

Related