Language support

This commit is contained in:
MLSDCherryPick
2026-01-24 05:10:47 +00:00
parent c0c2af984e
commit a3e99daedd
3 changed files with 39 additions and 1 deletions
Binary file not shown.

After

Width:  |  Height:  |  Size: 2.7 MiB

+3 -1
View File
@@ -21,7 +21,9 @@
<h3>📰 News</h3>
<strong>2026-01-21: 📣 We open-sourced <a href="docs/vibevoice-asr.md"><strong>VibeVoice-ASR</strong></a>, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. Try it in [Playground](https://aka.ms/vibevoice-asr)</strong>.
<strong>2026-01-21: 📣 We open-sourced <a href="docs/vibevoice-asr.md"><strong>VibeVoice-ASR</strong></a>, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. Try it in [Playground](https://aka.ms/vibevoice-asr)</strong>.
- ⭐️ VibeVoice-ASR is natively multilingual — see the [supported languages](docs/vibevoice-asr.md#language-distribution) for details.
- 🔥 The VibeVoice-ASR [finetuning code](finetuning-asr/README.md) is now available!
2025-12-16: 📣 We added experimental speakers to <a href="docs/vibevoice-realtime-0.5b.md"><strong>VibeVoiceRealtime0.5B</strong></a> for exploration, including multilingual voices in nine languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) and 11 distinct English style voices. [Try it](docs/vibevoice-realtime-0.5b.md#optional-more-experimental-voices). More speaker types will be added over time.
+36
View File
@@ -79,6 +79,34 @@ python demo/vibevoice_asr_gradio_demo.py --model_path microsoft/VibeVoice-ASR --
python demo/vibevoice_asr_inference_from_file.py --model_path microsoft/VibeVoice-ASR --audio_files [add a audio path here]
```
### Results
#### Multilingual
| Dataset | Language | DER | cpWER | tcpWER | WER |
|----------------|-----------|------|-------|--------|------|
| MLC-Challenge | English | 4.28 | 11.48 | 13.02 | 7.99 |
| MLC-Challenge | French | 3.80 | 18.80 | 19.64 | 15.21 |
| MLC-Challenge | German | 1.04 | 17.10 | 17.26 | 16.30 |
| MLC-Challenge | Italian | 2.08 | 15.76 | 15.91 | 13.91 |
| MLC-Challenge | Japanese | 0.82 | 15.33 | 15.41 | 14.69 |
| MLC-Challenge | Korean | 4.52 | 15.35 | 16.07 | 9.65 |
| MLC-Challenge | Portuguese| 7.98 | 29.91 | 31.65 | 21.54 |
| MLC-Challenge | Russian | 0.90 | 12.94 | 12.98 | 12.40 |
| MLC-Challenge | Spanish | 2.67 | 10.51 | 11.71 | 8.04 |
| MLC-Challenge | Thai | 4.09 | 14.91 | 15.57 | 13.61 |
| MLC-Challenge | Vietnamese| 0.16 | 14.57 | 14.57 | 14.43 |
---
| Dataset | Language | DER | cpWER | tcpWER | WER |
|----------------|-----------|------|-------|--------|------|
| AISHELL-4 | Chinese | 6.77 | 24.99 | 25.35 | 21.40 |
| AMI-IHM | English | 11.92| 20.41 | 20.82 | 18.81 |
| AMI-SDM | English | 13.43| 28.82 | 29.80 | 24.65 |
| AliMeeting | Chinese | 10.92| 29.33 | 29.51 | 27.40 |
| MLC-Challenge | Average | 3.42 | 14.81 | 15.66 | 12.07|
## Finetuning
LoRA (Low-Rank Adaptation) fine-tuning is supported. See [Finetuning](../finetuning-asr/README.md) for detailed guide.
@@ -86,3 +114,11 @@ LoRA (Low-Rank Adaptation) fine-tuning is supported. See [Finetuning](../finetun
## 📄 License
This project is licensed under the [MIT License](../LICENSE).
## Language Distribution
<p align="center">
<img src="../Figures/language_distribution_horizontal.png" alt="Language Distribution" width="80%">
</p>