diff --git a/Figures/language_distribution_horizontal.png b/Figures/language_distribution_horizontal.png new file mode 100644 index 0000000..4c1eca7 Binary files /dev/null and b/Figures/language_distribution_horizontal.png differ diff --git a/README.md b/README.md index fe63f7c..9ec43f7 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,9 @@

📰 News

-2026-01-21: 📣 We open-sourced VibeVoice-ASR, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. Try it in [Playground](https://aka.ms/vibevoice-asr). +2026-01-21: 📣 We open-sourced VibeVoice-ASR, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. Try it in [Playground](https://aka.ms/vibevoice-asr). +- ⭐️ VibeVoice-ASR is natively multilingual — see the [supported languages](docs/vibevoice-asr.md#language-distribution) for details. +- 🔥 The VibeVoice-ASR [finetuning code](finetuning-asr/README.md) is now available! 2025-12-16: 📣 We added experimental speakers to VibeVoice‑Realtime‑0.5B for exploration, including multilingual voices in nine languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) and 11 distinct English style voices. [Try it](docs/vibevoice-realtime-0.5b.md#optional-more-experimental-voices). More speaker types will be added over time. diff --git a/docs/vibevoice-asr.md b/docs/vibevoice-asr.md index 9d2bb4e..16300c9 100644 --- a/docs/vibevoice-asr.md +++ b/docs/vibevoice-asr.md @@ -79,6 +79,34 @@ python demo/vibevoice_asr_gradio_demo.py --model_path microsoft/VibeVoice-ASR -- python demo/vibevoice_asr_inference_from_file.py --model_path microsoft/VibeVoice-ASR --audio_files [add a audio path here] ``` +### Results + +#### Multilingual +| Dataset | Language | DER | cpWER | tcpWER | WER | +|----------------|-----------|------|-------|--------|------| +| MLC-Challenge | English | 4.28 | 11.48 | 13.02 | 7.99 | +| MLC-Challenge | French | 3.80 | 18.80 | 19.64 | 15.21 | +| MLC-Challenge | German | 1.04 | 17.10 | 17.26 | 16.30 | +| MLC-Challenge | Italian | 2.08 | 15.76 | 15.91 | 13.91 | +| MLC-Challenge | Japanese | 0.82 | 15.33 | 15.41 | 14.69 | +| MLC-Challenge | Korean | 4.52 | 15.35 | 16.07 | 9.65 | +| MLC-Challenge | Portuguese| 7.98 | 29.91 | 31.65 | 21.54 | +| MLC-Challenge | Russian | 0.90 | 12.94 | 12.98 | 12.40 | +| MLC-Challenge | Spanish | 2.67 | 10.51 | 11.71 | 8.04 | +| MLC-Challenge | Thai | 4.09 | 14.91 | 15.57 | 13.61 | +| MLC-Challenge | Vietnamese| 0.16 | 14.57 | 14.57 | 14.43 | + +--- + +| Dataset | Language | DER | cpWER | tcpWER | WER | +|----------------|-----------|------|-------|--------|------| +| AISHELL-4 | Chinese | 6.77 | 24.99 | 25.35 | 21.40 | +| AMI-IHM | English | 11.92| 20.41 | 20.82 | 18.81 | +| AMI-SDM | English | 13.43| 28.82 | 29.80 | 24.65 | +| AliMeeting | Chinese | 10.92| 29.33 | 29.51 | 27.40 | +| MLC-Challenge | Average | 3.42 | 14.81 | 15.66 | 12.07| + + ## Finetuning LoRA (Low-Rank Adaptation) fine-tuning is supported. See [Finetuning](../finetuning-asr/README.md) for detailed guide. @@ -86,3 +114,11 @@ LoRA (Low-Rank Adaptation) fine-tuning is supported. See [Finetuning](../finetun ## 📄 License This project is licensed under the [MIT License](../LICENSE). + + +## Language Distribution +

+ Language Distribution +

+ +