WhisperKit is an open-source Swift package from Argmax that runs OpenAI's Whisper speech-to-text models on Apple Silicon Macs and iOS devices. It compiles Whisper to Apple's CoreML and MLX runtimes, leveraging the Neural Engine and Metal Performance Shaders for fast on-device transcription.

How does WhisperKit differ from running Whisper directly?

Plain Whisper (the OpenAI Python implementation) runs on PyTorch and is slow on Mac without GPU acceleration. WhisperKit pre-compiles the model graph for Apple Silicon's specific hardware — Neural Engine for matrix-heavy operations, GPU via Metal for parallel work, CPU for orchestration — and ships the binary as a Swift Package developers can import.

Is WhisperKit free and open source?

Yes. WhisperKit is open-source on GitHub (Argmax-Inc/WhisperKit) under an MIT license. Apps that integrate it can ship as commercial or free products without paying licensing fees, though Argmax also offers a paid commercial product (WhisperKit Pro) with additional features and SLAs for enterprise use.

Which Mac apps use WhisperKit?

Several Mac transcription and dictation apps use WhisperKit either as primary or fallback engine, including Hapi, MacWhisper, and others. Many apps that advertise 'Whisper running locally on your Mac' use WhisperKit under the hood — it is the de facto standard for shipping Whisper on Apple Silicon.

Should I integrate WhisperKit directly or use an app that bundles it?

Direct integration makes sense if you're building a product. For end-users, a polished app that bundles WhisperKit (with model management, hotkeys, formatting pipeline, meeting capture, and UX) is usually a better experience than rolling your own — the same way most users prefer Mail.app over a raw IMAP client.

2026 · 05 · 08

What Is WhisperKit? Apple Silicon Whisper Inference Explained (2026)

WhisperKit is Argmax's open-source Swift framework for running Whisper speech-to-text on Apple Silicon. Here's what it is, how it works, and why Mac apps use it.

5 min read·macOS

WhisperKit is one of the more important pieces of infrastructure in the modern Mac speech-to-text ecosystem. If you've used a Mac transcription or dictation app that runs Whisper locally — without sending audio to the cloud — there's a good chance WhisperKit is doing the work under the hood. This guide explains what it is, how it works, and what the practical implications are for users and developers.

The Problem WhisperKit Solves

OpenAI released Whisper in 2022 as an open-weights speech-to-text model. The reference implementation is in Python and PyTorch — fine for cloud servers, slow and memory-heavy on a typical Mac. Several community efforts ported Whisper to faster runtimes (whisper.cpp for CPU, MLX for Apple Silicon, ONNX, ggml). Each addressed pieces of the problem.

WhisperKit, from Argmax, takes the specific bet that Apple Silicon deserves a first-class native runtime. It compiles Whisper to a combination of Apple's CoreML (which targets the Neural Engine and a fixed-function ANE) and the MLX framework (which targets the unified memory + GPU). The result is a Swift Package that any Mac or iOS app can import.

What WhisperKit Does Internally

When an app uses WhisperKit, the runtime path roughly looks like this:

Stage	Hardware target	What happens
Audio preprocessing	CPU	16 kHz mono PCM, 30-second windows, log-Mel spectrograms
Encoder	Neural Engine (CoreML)	Whisper's transformer encoder on the Mel spectrogram
Decoder	Neural Engine + GPU	Autoregressive token generation with KV cache
Postprocessing	CPU	Token-to-text decoding, timestamp alignment, language ID

The crucial design choice is routing different parts of inference to different Apple Silicon units. The encoder is heavily matrix-multiplication-bound and matches the Neural Engine's strengths. The decoder benefits from GPU parallelism for batch operations. CPU handles the orchestration. This split is what makes WhisperKit faster than naive ports.

Performance Profile

On Apple Silicon Macs, WhisperKit hits real-time factors well below 1.0 — meaning it transcribes a 60-minute file in less than 60 seconds — for most model sizes. Specific numbers from public benchmarks:

Model	Approx file size	Real-time factor on M-series
Tiny	75 MB	below 0.05×
Base	142 MB	below 0.1×
Small	466 MB	below 0.2×
Medium	1.5 GB	below 0.5×
Large-v3	2.9 GB	below 1.0× (varies)

For comparison, naive PyTorch Whisper without GPU is often 2-5× slower than real-time on the same hardware. The gap is real.

Why Apps Choose WhisperKit

A Mac app that wants on-device speech-to-text faces a build-vs-buy decision:

Option	Pros	Cons
Plain Whisper.cpp	Cross-platform, mature	Slower on Apple Silicon than ANE-targeted runtimes
Roll your own CoreML conversion	Full control	Months of engineering, model-quality risk
WhisperKit	Optimized for Apple Silicon, Swift Package, MIT licensed	Apple platforms only
Other engines (Parakeet, etc.)	Different accuracy/speed trade-offs	Different model family

For most Mac apps, WhisperKit is the right answer. Hapi uses it as one of its engine paths, alongside Parakeet for streaming dictation.

What This Means for End Users

If you're a Mac user evaluating speech-to-text apps, "powered by WhisperKit" tells you several things:

Audio stays on your Mac. WhisperKit is designed for on-device inference. An app using WhisperKit is architecturally a local-first product, not a cloud SaaS in disguise.
Quality is consistent across apps. Different apps using WhisperKit get the same underlying accuracy from the same Whisper model. Differentiation happens in the UX layer — formatting, hotkeys, meeting capture, language detection — not the raw transcription.
Hardware matters. WhisperKit is fastest on M2/M3/M4 with their improved Neural Engines. Older M1 chips work but produce noticeably longer transcription times for large models. Intel Macs are not supported.

How Hapi Uses WhisperKit

Hapi's transcription engine is a hybrid:

Streaming dictation uses Parakeet-class models for the lowest possible latency on short voice notes (~2 seconds end-to-end on Apple Silicon).
Batch / meeting transcription uses WhisperKit-class models for the highest accuracy on longer recordings, particularly for multilingual content and challenging audio conditions.
Diarization runs ECAPA-based speaker embeddings independently of WhisperKit's transcription path.
Language detection is automatic per segment, leveraging the multilingual capability of the underlying models.

The user experience hides all of this — you press a hotkey or join a meeting and the right engine runs. The architectural transparency matters because it's why every byte of audio stays on the Mac.

The Bigger Picture

WhisperKit is one piece of a broader shift. Five years ago, "speech-to-text on a Mac" meant uploading audio to a vendor's cloud. Today, the combination of capable open-weights models (Whisper, Parakeet), Apple Silicon's Neural Engine, and runtimes like WhisperKit means that for most use cases, the cloud is no longer required.

For sensitive content — healthcare, legal, journalism, regulated industries — that shift is more than a privacy nice-to-have. It's the difference between an architecturally compliant tool and a non-starter.

Bottom Line

WhisperKit is the de facto standard for running Whisper-family speech-to-text models on Apple Silicon. End users encounter it through the apps that use it; developers integrate it as a Swift Package; the practical effect is that on-device transcription on a Mac is now a competently solved problem.

For a deeper dive into why this matters for users, see our local speech-to-text guide.