Microsoft MAI-Transcribe-1 and MAI-Voice-1: Fast, Cheap, In-House

Microsoft's first in-house speech models: MAI-Transcribe-1 beats Whisper on all 25 languages at half the GPU cost, and MAI-Voice-1 generates 60 seconds of expressive audio in under one second.

Microsoft has been licensing everyone else’s models for years — OpenAI’s GPT series, Meta’s Llama, Mistral. On April 2, the company shipped something different: two foundation models built entirely in-house, and they’re both about voice.

MAI-Transcribe-1 handles speech-to-text. MAI-Voice-1 handles text-to-speech. Together with the previously announced MAI-Image-2, they form Microsoft’s first real foundation model family — not a wrapper around someone else’s technology, but models trained from scratch on Microsoft’s own compute.

MAI-Transcribe-1: The Whisper Killer

The headline number: 3.8% average word error rate across 25 languages on the FLEURS benchmark. That beats OpenAI’s Whisper-large-v3 on every single one of those 25 languages. Batch transcription runs 2.5x faster than Microsoft’s current Azure Fast offering.

What makes this more than a benchmark flex is the engineering for messy audio. MAI-Transcribe-1 was specifically trained on background noise, low-quality recordings, and overlapping speakers — the kind of audio that makes enterprise transcription painful. It accepts MP3, WAV, and FLAC files up to 200MB. Diarization, contextual biasing, and streaming are listed as “coming soon,” which means the model ships today as a batch tool with real-time capabilities following.

Pricing sits at $0.36 per hour of audio. That’s aggressive — roughly half what most enterprise transcription APIs charge.

MAI-Voice-1: One Second to Sixty

MAI-Voice-1 generates 60 seconds of expressive speech in under one second on a single GPU. Custom voice creation works from just a few seconds of reference audio through Microsoft Foundry — you feed it a short clip, it clones the voice, and it preserves speaker identity across long-form content.

The emphasis on “expressive” and “emotional range” matters because most TTS models sound flat over anything longer than a paragraph. Microsoft is claiming MAI-Voice-1 maintains natural prosody and nuance even in long-form output — the kind of claim you’d want to verify with your own ears, but the underlying architecture is clearly designed for more than basic narration.

Pricing starts at $22 per million characters. Both models are available through Microsoft Foundry and the MAI Playground.

The Strategic Play

Microsoft building its own speech models while still distributing OpenAI’s is the kind of hedging that makes partnership dynamics interesting. ElevenLabs just landed IBM’s agent voice business — the enterprise TTS market is fragmenting fast, and Microsoft clearly doesn’t want to depend on partners for a capability this fundamental.

The MAI family now covers image generation, speech transcription, and voice synthesis. The gap is a text model — and if Microsoft is willing to build foundation models for three modalities, the fourth feels inevitable.