diff --git a/README.md b/README.md index d01d56b..1c36f69 100644 --- a/README.md +++ b/README.md @@ -20,7 +20,7 @@

📰 News

-2026-01-21: 📣 We open-sourced VibeVoice-ASR, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. +2026-01-21: 📣 We open-sourced VibeVoice-ASR, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. [Try it.](https://aka.ms/vibevoice-asr) 2025-12-16: 📣 We added more experimental speakers for exploration, including multilingual voices and 11 distinct English style voices. [Try it](docs/vibevoice-realtime-0.5b.md#optional-more-experimental-voices). More speaker types will be added over time. diff --git a/docs/vibevoice-asr.md b/docs/vibevoice-asr.md index e1318f2..6a59842 100644 --- a/docs/vibevoice-asr.md +++ b/docs/vibevoice-asr.md @@ -1,5 +1,7 @@ # VibeVoice-ASR +[![Hugging Face](https://img.shields.io/badge/HuggingFace-Collection-orange?logo=huggingface)](https://huggingface.co/microsoft/VibeVoice-ASR) +[![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr) **VibeVoice-ASR** is the latest addition to the **VibeVoice** family. While the original VibeVoice / VibeVoice-Realtime focused on expressive TTS, **VibeVoice-ASR** focuses on understanding long-form speech with high precision and rich metadata. It is a unified speech-to-text model designed to handle **1-hour long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **User-Customized Context**. @@ -15,6 +17,8 @@ It is a unified speech-to-text model designed to handle **1-hour long-form audio - **📝 Rich Transcription (Who, When, What)**: The model performs ASR, Diarization, and Timestamping simultaneously. The output is a structured sequence indicating *who* said *what* at *which time*. +[Try it here.](https://aka.ms/vibevoice-asr) + ## 🏗️ Model Architecture