From 142a00112e4594f02da4b3bb2458f07b77883278 Mon Sep 17 00:00:00 2001 From: YaoyaoChang Date: Tue, 27 Jan 2026 20:58:10 +0800 Subject: [PATCH] update ASR README: multilingual --- docs/vibevoice-asr.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/vibevoice-asr.md b/docs/vibevoice-asr.md index e86fbe8..3ec7581 100644 --- a/docs/vibevoice-asr.md +++ b/docs/vibevoice-asr.md @@ -3,7 +3,7 @@ [![Hugging Face](https://img.shields.io/badge/HuggingFace-Collection-orange?logo=huggingface)](https://huggingface.co/microsoft/VibeVoice-ASR) [![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr) -**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords**. +**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords** and over **50 languages**. **Model:** [VibeVoice-ASR-7B](https://huggingface.co/microsoft/VibeVoice-ASR)
**Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)
@@ -22,6 +22,9 @@ - **📝 Rich Transcription (Who, When, What)**: The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*. + +- **🌍 Multilingual & Code-Switching Support**: + It supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Language distribution can be found [here](#language-distribution) ## 🏗️ Model Architecture