Skip to content
nah

Transcribe Audio

Transcribe any audio or video file entirely in your browser — no uploads, no accounts.

100% in your browser — files never leave your device

Whisper model

~39 MB one-time download · English speech, fastest processing

On-device audio transcription — free, private, no upload

This tool converts speech to text using Whisper, an open-source automatic speech recognition model developed by OpenAI, running entirely inside your browser via WebAssembly and optionally WebGPU. No audio is transmitted to any server at any point. The model is downloaded once from the Hugging Face Hub CDN and cached in your browser — after the initial download, transcription works with no internet connection.

Three model sizes are available. The English Fast model is the right choice for most English recordings — meetings, interviews, lectures, and podcasts — and is the smallest download at around 39 MB. The Multilingual model adds support for non-English languages at around 73 MB. The High Accuracy model provides better results on technically complex content, overlapping speech, or heavy accents at around 237 MB.

Output is available in three formats: plain text for direct reading and editing, SRT for video subtitle tracks, and VTT for web video captions (HTML5 <track> elements). All three formats include word-level timestamps from Whisper's output. A PDF notes export is also available, with optional timestamps per paragraph, formatted for reading and printing.

Frequently asked questions

Does my audio get uploaded to a server?

No. Transcription runs entirely in your browser using Whisper, an open-source speech recognition model from OpenAI. The audio never leaves your device. The Whisper model files download once from Hugging Face and are then cached locally — subsequent uses are fully offline.

How large is the model download?

The English Fast model (whisper-tiny.en) downloads approximately 39 MB on first use. The Multilingual model (whisper-base) is approximately 73 MB. The High Accuracy model (whisper-small) is approximately 237 MB. All sizes are for the default quantization (q8). After the first download, the model is cached in your browser and no further network traffic occurs.

Which model should I use for English speech?

English Fast (whisper-tiny.en) is the right starting point for most English content — meetings, interviews, lectures, and podcasts. It is the smallest model and runs fastest. Switch to High Accuracy (whisper-small) if you need better results on technical jargon, heavy accents, or noisy audio. The Multilingual model is optimized for non-English languages.

How fast is transcription?

Speed depends entirely on your hardware and which backend the browser uses. On devices with a compatible GPU, the WebGPU backend is used and transcription is significantly faster than CPU. Without WebGPU, the WebAssembly backend runs on CPU. As a rough frame of reference: a device running in CPU mode might process a 10-minute file in anywhere from 2 to 15 minutes depending on the processor. No specific numbers are guaranteed.

What is the practical limit on file length?

There is no hard limit imposed by this tool. In practice, longer files require more RAM to hold the decoded PCM in memory and more processing time. Files up to about 2 hours are generally manageable on modern devices. For very long recordings, consider extracting the audio track first (use the Extract Audio tool) to reduce the file size before transcribing.

Related tools