restructure README

2026-01-22 00:26:54 -08:00
parent ce90a49960
commit 32a7040ce0
5 changed files with 247 additions and 47 deletions
@@ -21,58 +21,85 @@

 <h3>📰 News</h3>

-<strong>2026-01-21: 📣 We open-sourced <a href="docs/vibevoice-asr.md"><strong>VibeVoice-ASR</strong></a>, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. [Try it.](https://aka.ms/vibevoice-asr)</strong>
+<strong>2026-01-21: 📣 We open-sourced <a href="docs/vibevoice-asr.md"><strong>VibeVoice-ASR</strong></a>, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. Try it in [Playground](https://aka.ms/vibevoice-asr)</strong>.

-https://github.com/user-attachments/assets/acde5602-dc17-4314-9e3b-c630bc84aefa
-
-<p align="center">
-  <img src="Figures/DER.jpg" alt="DER" height="200">
-  <img src="Figures/cpWER.jpg" alt="cpWER" height="200">
-  <img src="Figures/tcpWER.jpg" alt="tcpWER" height="200">
-</p>
-
-2025-12-16: 📣 We added more experimental speakers for exploration, including multilingual voices and 11 distinct English style voices. [Try it](docs/vibevoice-realtime-0.5b.md#optional-more-experimental-voices). More speaker types will be added over time.
-
-2025-12-09: 📣 We added experimental speakers in nine languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) for exploration—welcome to try them out and share your feedback.
+2025-12-16: 📣 We added experimental speakers to <a href="docs/vibevoice-realtime-0.5b.md"><strong>VibeVoice‑Realtime‑0.5B</strong></a> for exploration, including multilingual voices in nine languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) and 11 distinct English style voices. [Try it](docs/vibevoice-realtime-0.5b.md#optional-more-experimental-voices). More speaker types will be added over time.

 2025-12-03: 📣 We open-sourced <a href="docs/vibevoice-realtime-0.5b.md"><strong>VibeVoice‑Realtime‑0.5B</strong></a>, a real‑time text‑to‑speech model that supports streaming text input and robust long-form speech generation. Try it on [Colab](https://colab.research.google.com/github/microsoft/VibeVoice/blob/main/demo/vibevoice_realtime_colab.ipynb).

-To mitigate deepfake risks and ensure low latency for the first speech chunk, voice prompts are provided in an embedded format. For users requiring voice customization, please reach out to our team. We will also be expanding the range of available speakers.
-<br>

-https://github.com/user-attachments/assets/0901d274-f6ae-46ef-a0fd-3c4fba4f76dc
+2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.

-> (Launch your own realtime demo via the websocket example in [Usage](docs/vibevoice-realtime-0.5b.md#usage-1-launch-real-time-websocket-demo)).
+
+2025-08-25: 📣 We open-sourced <a href="docs/vibevoice-tts.md"><strong>VibeVoice-TTS</strong></a>, a long-form multi-speaker text-to-speech model that can synthesize speech up to 90 minutes long with up to 4 distinct speakers.

 </div>

-2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have disabled this repo until we are confident that out-of-scope use is no longer possible.
+## Overview
+
+VibeVoice is a **family of open-source frontier voice AI models** that includes both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) models. 
+
+A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of **7.5 Hz**. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https://arxiv.org/abs/2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
+
+For more information, demos, and examples, please visit our [Project Page](https://microsoft.github.io/VibeVoice).


-### Overview
+<div align="center">

-VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
+| Model |   Weight | Quick Try |
+|-------|--------------|---------|
+| VibeVoice-TTS-1.5B | [HF Link](https://huggingface.co/microsoft/VibeVoice-1.5B) | Disabled |
+| VibeVoice-Realtime-0.5B | [HF Link](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | [Colab](https://colab.research.google.com/github/microsoft/VibeVoice/blob/main/demo/vibevoice_realtime_colab.ipynb) |
+| VibeVoice-ASR-7B | [HF Link](https://huggingface.co/microsoft/VibeVoice-ASR) |  [Playground](https://aka.ms/vibevoice-asr) |

-VibeVoice currently includes two model variants:
+</div>

- **Long-form multi-speaker model**: Synthesizes conversational/single-speaker speech up to **90 minutes** with up to **4 distinct speakers**, surpassing the typical 1–2 speaker limits of many prior models.
- **[Realtime streaming TTS model](docs/vibevoice-realtime-0.5b.md)**: Produces initial audible speech in ~**300 ms** and supports **streaming text input** for single-speaker **real-time** speech generation; designed for low-latency generation.
-
-A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https://arxiv.org/abs/2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
+## Models


-<p align="left">
-  <img src="Figures/MOS-preference.png" alt="MOS Preference Results" height="260px">
-  <img src="Figures/VibeVoice.jpg" alt="VibeVoice Overview" height="250px" style="margin-right: 10px;">
-</p>
+### 1. 📖 [VibeVoice-ASR](docs/vibevoice-asr.md) - Long-form Speech Recognition
+
+**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords**.
+
+- **🕒 60-minute Single-Pass Processing**:
+  Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
+
+- **👤 Customized Hotwords**:
+  Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.
+
+- **📝 Rich Transcription (Who, When, What)**:
+  The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.
+
+[📖 Documentation](docs/vibevoice-asr.md) | [🤗 Hugging Face](https://huggingface.co/microsoft/VibeVoice-ASR) | [🎮 Playground](https://aka.ms/vibevoice-asr)


-### 🎵 Demo Examples
+<div align="center" id="vibevoice-asr">
+
+https://github.com/user-attachments/assets/acde5602-dc17-4314-9e3b-c630bc84aefa
+
+</div>


-**Video Demo**
+### 2. 🎙️ [VibeVoice-TTS](docs/vibevoice-tts.md) - Long-form Multi-speaker TTS
+
+**Best for**: Long-form conversational audio, podcasts, multi-speaker dialogues
+
+- **⏱️ 90-minute Long-form Generation**:
+  Synthesizes conversational/single-speaker speech up to **90 minutes** in a single pass, maintaining speaker consistency and semantic coherence throughout.
+
+- **👥 Multi-speaker Support**:
+  Supports up to **4 distinct speakers** in a single conversation, with natural turn-taking and speaker consistency across long dialogues.
+
+- **🎭 Expressive Speech**:
+  Generates expressive, natural-sounding speech that captures conversational dynamics and emotional nuances.
+
+- **🌐 Multi-lingual Support**:
+  Supports English, Chinese and other languages.
+
+
+[📖 Documentation](docs/vibevoice-tts.md) | [🤗 Hugging Face](https://huggingface.co/microsoft/VibeVoice-1.5B)  |  [📊 Paper](https://arxiv.org/pdf/2508.19205)
+

-We produced this video with [Wan2.2](https://github.com/Wan-Video/Wan2.2). We sincerely appreciate the Wan-Video team for their great work.

 **English**
 <div align="center">
@@ -111,20 +138,36 @@ https://github.com/user-attachments/assets/a357c4b6-9768-495c-a576-1618f6275727

 </div>

-For more examples, see the [Project Page](https://microsoft.github.io/VibeVoice).



-## Risks and limitations
+
+
+### 3. ⚡ [VibeVoice-Streaming](docs/vibevoice-realtime-0.5b.md) - Real-time Streaming TTS
+
+VibeVoice-Realtime is a **lightweight real‑time** text-to-speech model supporting **streaming text input** and **robust long-form speech generation**.
+
+- Parameter size: 0.5B (deployment-friendly)
+- Real-time TTS (~300 milliseconds first audible latency)
+- Streaming text input
+- Robust long-form speech generation (~10 minutes)
+
+[📖 Documentation](docs/vibevoice-realtime-0.5b.md) | [🤗 Hugging Face](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | [🚀 Colab](https://colab.research.google.com/github/microsoft/VibeVoice/blob/main/demo/vibevoice_realtime_colab.ipynb)
+
+
+<div align="center" id="generated-example-audio-vibevoice-realtime">
+
+https://github.com/user-attachments/assets/0901d274-f6ae-46ef-a0fd-3c4fba4f76dc
+
+</div>
+
+
+## ⚠️ Risks and Limitations
+

 While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release).
 Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.

-English and Chinese only: Transcripts in languages other than English or Chinese may result in unexpected audio outputs.
-
-Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.
-
-Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.

 We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.

@@ -5,6 +5,9 @@

 **VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords**.

+**Model:** [VibeVoice-ASR-7B](https://huggingface.co/microsoft/VibeVoice-ASR)<br>
+**Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)<br>
+
 ## 🔥 Key Features

 - **🕒 60-minute Single-Pass Processing**:
@@ -16,7 +19,6 @@
 - **📝 Rich Transcription (Who, When, What)**:
  The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.

-**Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)

 ## 🏗️ Model Architecture

@@ -24,6 +26,14 @@
  <img src="../Figures/VibeVoice_ASR_archi.png" alt="VibeVoice ASR Architecture" width="80%">
 </p>

+# Demo
+
+<div align="center" id="vibevoice-asr">
+
+https://github.com/user-attachments/assets/acde5602-dc17-4314-9e3b-c630bc84aefa
+
+</div>
+
 ## Evaluation
 <p align="center">
  <img src="../Figures/DER.jpg" alt="DER" width="50%">
@@ -31,6 +41,8 @@
  <img src="../Figures/tcpWER.jpg" alt="tcpWER" width="50%">
 </p>

+
+
 ## Installation
 We recommend to use NVIDIA Deep Learning Container to manage the CUDA environment. 

@@ -5,15 +5,12 @@
 [![Colab](https://img.shields.io/badge/Run-Colab-orange?logo=googlecolab)](https://colab.research.google.com/github/microsoft/VibeVoice/blob/main/demo/vibevoice_realtime_colab.ipynb)
 </div>

-VibeVoice-Realtime is a **lightweight real‑time** text-to-speech model supporting **streaming text input** and **robust long-form speech generation**. It can be used to build real-time TTS services, narrate live data streams, and let different LLMs start speaking from their very first tokens (plug in your preferred model) long before a full answer is generated. It produces initial audible speech in **~300 milliseconds** (hardware dependent).
+VibeVoice-Realtime is a **lightweight real‑time** text-to-speech model supporting **streaming text input** and **robust long-form speech generation**. It can be used to build real-time TTS services, narrate live data streams, and let different LLMs start speaking from their very first tokens (plug in your preferred model) long before a full answer is generated. It produces initial audible speech in **~200 milliseconds** (hardware dependent).

-<div align="center">

-| Model | Context Length | Generation Length |  Weight |
-|-------|----------------|----------|----------|
-| VibeVoice-Realtime-0.5B | 8K | ~10 min | [HF link](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) |
+**Model:** [VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B)<br>
+**Colab:** [Link](https://colab.research.google.com/github/microsoft/VibeVoice/blob/main/demo/vibevoice_realtime_colab.ipynb)<br>

-</div>

 > Note (multilingual exploration): Although the model is primarily built for English, we found that it still exhibits a certain level of multilingual capability—and even performs reasonably well in some languages. We provide nine additional languages (German, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, and Spanish) for users to explore. These multilingual behaviors have not been extensively tested; use with caution and share observations.

@@ -30,9 +27,10 @@ The model uses an interleaved, windowed design: it incrementally encodes incomin

 Key features:
 - Parameter size: 0.5B (deployment-friendly)
- Real-time TTS (~300 milliseconds first audible latency)
+- Real-time TTS (~200 milliseconds first audible latency)
 - Streaming text input
 - Robust long-form speech generation
+- 8k context window( ~10 minutes audio generation)

 This real-time variant supports only a single speaker. For multi‑speaker conversational speech generation, please use other VibeVoice models (long‑form multi‑speaker variants). The model is currently intended for English speech only; other languages may produce unpredictable results.

@@ -41,7 +39,7 @@ To mitigate deepfake risks and ensure low latency for the first speech chunk, vo

 ### 📋 TODO

- [ ] Add more voices (expand available speakers/voice timbres)
+- [√] Add more voices (expand available speakers/voice timbres)
 - [ ] Implement streaming text input function to feed new tokens while audio is still being generated
 - [ ] Merge models into official HuggingFace's `transformers` repository 

@@ -0,0 +1,147 @@
+# VibeVoice-TTS
+
+[![Hugging Face](https://img.shields.io/badge/HuggingFace-Collection-orange?logo=huggingface)](https://huggingface.co/microsoft/VibeVoice-1.5B)
+[![Technical Report](https://img.shields.io/badge/Technical-Report-red?logo=arxiv)](https://arxiv.org/pdf/2508.19205)
+
+**VibeVoice-TTS** is a **long-form**, **multi-speaker** text-to-speech model designed to generate **expressive conversational audio** such as podcasts from text. It can synthesize speech up to **90 minutes** long with up to **4 distinct speakers**, surpassing the typical 1–2 speaker limits of many prior models.
+
+
+**Model:** [VibeVoice-1.5B](https://huggingface.co/microsoft/VibeVoice-1.5B)<br>
+**Report:** [Technical Report](https://arxiv.org/pdf/2508.19205)<br>
+
+
+<div align="center">
+
+| Model | Context Length | Generation Length | Weight |
+|-------|----------------|-------------------|--------|
+| VibeVoice-1.5B | 64K | ~90 min | [HF link](https://huggingface.co/microsoft/VibeVoice-1.5B) |
+| VibeVoice-Large | 32K | ~45 min | Disabled |
+
+</div>
+
+## 🔥 Key Features
+
+- **⏱️ 90-minute Long-form Generation**:
+  Synthesizes conversational/single-speaker speech up to **90 minutes** in a single pass, maintaining speaker consistency and semantic coherence throughout.
+
+- **👥 Multi-speaker Support**:
+  Supports up to **4 distinct speakers** in a single conversation, with natural turn-taking and speaker consistency across long dialogues.
+
+- **🎭 Expressive Speech**:
+  Generates expressive, natural-sounding speech that captures conversational dynamics and emotional nuances.
+
+- **🌐 Multi-lingual Support**:
+  Supports English, Chinese and other languages.
+
+## 🏗️ Model Architecture
+
+VibeVoice-TTS employs a [next-token diffusion](https://arxiv.org/pdf/2508.19205) framework that combines:
+
+- **Large Language Model (LLM)**: Based on Qwen2.5, understands textual context and dialogue flow
+- **Continuous Speech Tokenizers**: Acoustic and Semantic tokenizers operating at an ultra-low frame rate of **7.5 Hz**, efficiently preserving audio fidelity while boosting computational efficiency
+- **Diffusion Head**: Generates high-fidelity acoustic details through diffusion-based generation
+
+<div align="center">
+  <img src="../Figures/VibeVoice.jpg" alt="VibeVoice Overview" width="80%">
+</div>
+
+
+## 🎵 Demo Examples
+
+**English**
+<div align="center">
+
+https://github.com/user-attachments/assets/0967027c-141e-4909-bec8-091558b1b784
+
+</div>
+
+**Chinese**
+<div align="center">
+
+https://github.com/user-attachments/assets/322280b7-3093-4c67-86e3-10be4746c88f
+
+</div>
+
+**Cross-Lingual**
+<div align="center">
+
+https://github.com/user-attachments/assets/838d8ad9-a201-4dde-bb45-8cd3f59ce722
+
+</div>
+
+**Spontaneous Singing**
+<div align="center">
+
+https://github.com/user-attachments/assets/6f27a8a5-0c60-4f57-87f3-7dea2e11c730
+
+</div>
+
+**Long Conversation with 4 people**
+<div align="center">
+
+https://github.com/user-attachments/assets/a357c4b6-9768-495c-a576-1618f6275727
+
+</div>
+
+For more examples, see the [Project Page](https://microsoft.github.io/VibeVoice).
+
+## Installation and Usage
+Disabled due to widespread misuse.
+
+## Results
+
+The model achieves state-of-the-art performance on long-form multi-speaker speech generation tasks. For detailed evaluation results, please refer to the [paper](https://arxiv.org/pdf/2508.19205).
+<div align="center">
+  <img src="../Figures/VibeVoice-TTS-results.jpg" alt="VibeVoice Results" width="80%">
+</div>
+
+
+
+## 🚨 Tips
+We observed users may encounter occasional instability when synthesizing Chinese speech. We recommend:
+
+- Using English punctuation even for Chinese text, preferably only commas and periods.
+- Using the Large model variant, which is considerably more stable.
+- If you found the generated voice speak too fast. Please try to chunk your text with multiple speaker turns with same speaker label.
+
+We'd like to thank [PsiPi](https://huggingface.co/PsiPi) for sharing an interesting way for emotion control. Detials can be found via [discussion12](https://huggingface.co/microsoft/VibeVoice-1.5B/discussions/12).
+
+
+## FAQ
+#### Q1: Is this a pretrained model?
+**A:** Yes, it's a pretrained model without any post-training or benchmark-specific optimizations. In a way, this makes VibeVoice very versatile and fun to use.
+
+#### Q2: Randomly trigger Sounds / Music / BGM.
+**A:** As you can see from our demo page, the background music or sounds are spontaneous. This means we can't directly control whether they are generated or not. The model is content-aware, and these sounds are triggered based on the input text and the chosen voice prompt.
+
+Here are a few things we've noticed:
+*   If the voice prompt you use contains background music, the generated speech is more likely to have it as well. (The Large model is quite stable and effective at this—give it a try on the demo!)
+*   If the voice prompt is clean (no BGM), but the input text includes introductory words or phrases like "Welcome to," "Hello," or "However," background music might still appear.
+*   Speaker voice related, using "Alice" results in random BGM than others (fixed).
+*   In other scenarios, the Large model is more stable and has a lower probability of generating unexpected background music.
+
+In fact, we intentionally decided not to denoise our training data because we think it's an interesting feature for BGM to show up at just the right moment. You can think of it as a little easter egg we left for you.
+
+#### Q3: Text normalization?
+**A:** We don't perform any text normalization during training or inference. Our philosophy is that a large language model should be able to handle complex user inputs on its own. However, due to the nature of the training data, you might still run into some corner cases.
+
+#### Q4: Singing Capability.
+**A:** Our training data **doesn't contain any music data**. The ability to sing is an emergent capability of the model (which is why it might sound off-key, even on a famous song like 'See You Again'). (The Large model is more likely to exhibit this than the 1.5B).
+
+#### Q5: Some Chinese pronunciation errors.
+**A:** The volume of Chinese data in our training set is significantly smaller than the English data. Additionally, certain special characters (e.g., Chinese quotation marks) may occasionally cause pronunciation issues.
+
+#### Q6: Instability of cross-lingual transfer.
+**A:** The model does exhibit strong cross-lingual transfer capabilities, including the preservation of accents, but its performance can be unstable. This is an emergent ability of the model that we have not specifically optimized. It's possible that a satisfactory result can be achieved through repeated sampling.
+
+## Risks and Limitations
+
+While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release). Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.
+
+English and Chinese only: Transcripts in languages other than English or Chinese may result in unexpected audio outputs.
+
+Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.
+
+Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.
+
+We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.