Microsoft Launches VibeVoice, a New Speech-to-Text Model

Posted on Apr 28, 2026 by CurrentLens in Models

Available under the MIT license, VibeVoice enhances transcription capabilities for audio content.

AI Quick Take

VibeVoice incorporates speaker diarization for easier audio analysis.
The model runs efficiently on various audio file formats with reasonable resource requirements.

Microsoft has launched VibeVoice, an AI - driven speech-to-text model inspired by Whisper, which includes built-in speaker diarization capabilities. This model was made available on January 21, 2026, and is designed to cater to various transcription needs by efficiently processing audio files with accurate results. Users can run VibeVoice using its dedicated command line interface on Mac, utilizing audio files in the .wav and .mp3 formats.

When tested, the model demonstrated its efficiency by transcribing an hour-long audio clip in approximately 8 minutes and 45 seconds, utilizing up to 61.5 GB of RAM during processing. By allowing users to adjust token limits, it provides flexibility for different audio durations, accommodating longer recordings without loss of fidelity.

This innovation positions Microsoft within the competitive landscape of speech recognition tools, appealing to industries relying on transcription for podcasts, meetings, and other spoken content. As VibeVoice is available under an MIT license, it encourages broader usage and integration across applications.

The launch of VibeVoice represents a notable advancement in Microsoft's suite of AI tools, especially in the domain of speech recognition and transcription technology. The integrated speaker diarization adds significant value, allowing users to distinguish between speakers seamlessly, which is crucial in environments such as interviews and collaborative discussions.

With this release, Microsoft may strengthen its appeal to professionals in various fields who prioritize effective audio processing for documentation and analysis. As industry demand grows for improved transcription solutions, monitoring VibeVoice's adoption and user feedback will be critical for understanding its impact on AI - driven audio technologies.

Latest
Trending

Models & Launches

DKPS method cuts model-evaluation queries using cached responses

CurrentLens
Jun 6, 2026

An arXiv paper introduces a DKPS-based approach that uses cached model outputs to predict benchmark scores while substantially reducing the number of queries.

Models & Launches

PIGMENT extends quantitative diffusion MRI to sparse, multi-site and low-field scans

CurrentLens
Jun 2, 2026

A physics-informed foundation model called PIGMENT learns a universal microstructure prior and adapts zero-shot to individual diffusion MRI scans, enabling reliable maps from sparse and heterogeneous data.

Models & Launches

ATOM Report Finds Chinese Open Models Overtook Western Peers in 2025

CurrentLens
May 27, 2026

A new ATOM analysis of about 1,500 open language models maps downloads, derivatives, inference share and performance, and reports Chinese models surpassed U.S.

Models & Launches

Authors Release OpenEval and Demand Item-Level Benchmark Standards

CurrentLens
May 25, 2026

A position paper argues AI evaluation must publish item-level benchmark responses and ships OpenEval - 10M model responses across 155k items - to prove the point.