update report link

This commit is contained in:
YaoyaoChang
2025-08-25 11:59:06 -07:00
parent 8ca79df4b1
commit 2dd0555ebc
+2 -2
View File
@@ -112,7 +112,7 @@
<a href="https://aka.ms/GeneralAI" target="_blank">MSRA GeneralAI Group</a>
</p> -->
<p class="links" style="text-align:center; margin:0 0 14px;">
<a href="https://github.com/microsoft/VibeVoice/report/TechnicalReport.pdf" target="_blank">📄 Report</a>
<a href="https://github.com/microsoft/VibeVoice/blob/main/report/TechnicalReport.pdf" target="_blank">📄 Report</a>
<span class="sep">·</span>
<a href="https://github.com/microsoft/VibeVoice" target="_blank"><svg width="16" height="16" fill="currentColor" viewBox="0 0 16 16" style="vertical-align: text-bottom;"><path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.012 8.012 0 0 0 16 8c0-4.42-3.58-8-8-8z"/></svg> Code</a>
<span class="sep">·</span>
@@ -124,7 +124,7 @@
</p>
<p class="muted" style="margin:0;">
VibeVoice is a novel framework designed for generating <b>expressive, long-form, multi-speaker conversational audio</b>, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
VibeVoice is a novel framework designed for generating <b>expressive, long-form, multi-speaker </b>conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
The model can synthesize speech up to 90 minutes long with up to 4 distinct speakers, surpassing the typical 1-2 speaker limits of many prior models.
</p>