VibeVoice

Author	SHA1	Message	Date
Jianwei Yu	5cd81bb497	fix: restore sequential encoder (batch encoder causes OOM) Batch encoder across multiple requests caused GPU OOM when vLLM scheduler sends many audio items at once. The encoder intermediates (~700MB per 69s audio) compete with KV cache for GPU memory. Sequential encoding is stable and proven correct. The encoder (267ms per request) is not the primary throughput bottleneck when encoder cache is enabled (default). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-27 18:48:06 +00:00
Jianwei Yu	cd945395d4	feat: set nginx workers to 2×dp for optimal HTTP throughput Nginx worker_processes now defaults to 2×N (where N is the number of DP replicas) instead of 'auto'. This ensures enough HTTP handler processes to fully saturate all GPU backends under heavy concurrent load. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-27 09:16:05 +00:00
Jianwei Yu	e6b65abb9b	fix: auto-tune per-worker env vars in DP mode Pass VIBEVOICE_FFMPEG_MAX_CONCURRENCY and VLLM_MEDIA_LOADING_THREAD_COUNT to each worker subprocess so they inherit the correct settings regardless of how the container is launched (--skip-deps or not). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-27 07:57:49 +00:00
Jianwei Yu	3817f74d46	feat: nginx-based data parallel for optimal ASR throughput When --dp N is specified (N > 1), the launcher now starts N independent vLLM processes behind an nginx reverse proxy instead of using vLLM's built-in DP coordinator. This avoids the single-process HTTP bottleneck when handling large base64 audio payloads, achieving near-linear scaling (7.2x with 8 GPUs at 4096 concurrent requests). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-27 07:43:32 +00:00
JianweiYu	9634518ca4	Add data parallel (DP) support to vLLM server launcher - Add --dp/--data-parallel-size flag for running independent model replicas across multiple GPUs with automatic load balancing behind a single port - Add --tp/--tensor-parallel-size flag (previously hardcoded to 1) - Update docs/vibevoice-vllm-asr.md with multi-GPU deployment guide covering DP, TP, and hybrid (DP × TP) configurations Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-24 11:53:31 +00:00
JianweiYu	09ca114fa3	Add Gradio ASR demo with video support and demo audio/video files - Add gradio_asr_demo_api_video.py: Gradio web UI supporting audio/video upload, streaming output, hotwords, and Cloudflare tunnel - Add demo/asr_demo/: demo audio and video files for the Gradio interface Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-22 06:11:51 +00:00
Damon-Salvetore	165e17e5ed	fix: vllm-version-stable	2026-02-25 07:30:43 +00:00
YingboHAO	a4add8e52f	fix backend	2026-02-08 09:58:19 +00:00
YingboHAO	0508c3e86f	fix	2026-02-06 14:38:16 +00:00
YingboHAO	7761242bf3	fix	2026-02-06 05:52:48 +00:00
YingboHAO	bb54f78d0e	feat: add hotwords support for vLLM ASR	2026-02-04 10:33:20 +00:00
YingboHAO	0055161273	Add test_api_auto_recover.py and test audio files	2026-02-02 13:49:01 +00:00
YingboHAO	1eb04f53a2	Replace install_deps.sh with start_server.py one-click deployment	2026-01-26 07:34:54 +00:00
YingboHAO	04f8bc40b0	Update test_api.py	2026-01-23 17:47:31 +00:00
YingboHAO	4df5b0582f	Add vLLM plugin support for high-performance ASR serving	2026-01-23 17:32:24 +00:00

15 Commits