The preview shifts audio generation from simple conversion toward controllable, tag - driven speech in more than 70 languages.
AI Quick Take
- Preview adds natural‑language audio tags and native multi‑speaker dialogue for finer expressive control.
- Native support for 70+ languages signals a push toward broader multilingual TTS use cases.
Google has launched Gemini 3.1 Flash TTS, a preview text‑to‑speech model that emphasizes speech quality, expressive control, and multilingual generation. The new build is presented as a shift from simple conversion workflows toward more controllable, tag‑driven audio output.
Gemini 3.1 Flash TTS introduces natural‑language audio tags and native multi‑speaker dialogue, and Google says it supports more than 70 languages natively. Those features are intended to let developers specify expression and speaker turns in natural language rather than treating generation as a black box.
The change in focus matters for teams building localized, dialogic, or expressive voice experiences because it promises finer control without custom engineering around speaker switching or language handling. What to watch next: how Google exposes these capabilities in APIs or tools, performance and quality benchmarks beyond the preview, and whether third‑party adopters report practical gains in multilingual and multi‑speaker scenarios.