15 Commits

Author SHA1 Message Date
Jianwei Yu 5cd81bb497 fix: restore sequential encoder (batch encoder causes OOM)
Batch encoder across multiple requests caused GPU OOM when vLLM
scheduler sends many audio items at once. The encoder intermediates
(~700MB per 69s audio) compete with KV cache for GPU memory.

Sequential encoding is stable and proven correct. The encoder
(267ms per request) is not the primary throughput bottleneck when
encoder cache is enabled (default).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-27 18:48:06 +00:00
Jianwei Yu cd945395d4 feat: set nginx workers to 2×dp for optimal HTTP throughput
Nginx worker_processes now defaults to 2×N (where N is the number of DP
replicas) instead of 'auto'. This ensures enough HTTP handler processes
to fully saturate all GPU backends under heavy concurrent load.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-27 09:16:05 +00:00
Jianwei Yu e6b65abb9b fix: auto-tune per-worker env vars in DP mode
Pass VIBEVOICE_FFMPEG_MAX_CONCURRENCY and VLLM_MEDIA_LOADING_THREAD_COUNT
to each worker subprocess so they inherit the correct settings regardless
of how the container is launched (--skip-deps or not).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-27 07:57:49 +00:00
Jianwei Yu 3817f74d46 feat: nginx-based data parallel for optimal ASR throughput
When --dp N is specified (N > 1), the launcher now starts N independent
vLLM processes behind an nginx reverse proxy instead of using vLLM's
built-in DP coordinator. This avoids the single-process HTTP bottleneck
when handling large base64 audio payloads, achieving near-linear scaling
(7.2x with 8 GPUs at 4096 concurrent requests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-27 07:43:32 +00:00
JianweiYu 9634518ca4 Add data parallel (DP) support to vLLM server launcher
- Add --dp/--data-parallel-size flag for running independent model replicas
  across multiple GPUs with automatic load balancing behind a single port
- Add --tp/--tensor-parallel-size flag (previously hardcoded to 1)
- Update docs/vibevoice-vllm-asr.md with multi-GPU deployment guide
  covering DP, TP, and hybrid (DP × TP) configurations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-24 11:53:31 +00:00
JianweiYu 09ca114fa3 Add Gradio ASR demo with video support and demo audio/video files
- Add gradio_asr_demo_api_video.py: Gradio web UI supporting audio/video upload,
  streaming output, hotwords, and Cloudflare tunnel
- Add demo/asr_demo/: demo audio and video files for the Gradio interface

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-22 06:11:51 +00:00
Damon-Salvetore 165e17e5ed fix: vllm-version-stable 2026-02-25 07:30:43 +00:00
YingboHAO a4add8e52f fix backend 2026-02-08 09:58:19 +00:00
YingboHAO 0508c3e86f fix 2026-02-06 14:38:16 +00:00
YingboHAO 7761242bf3 fix 2026-02-06 05:52:48 +00:00
YingboHAO bb54f78d0e feat: add hotwords support for vLLM ASR 2026-02-04 10:33:20 +00:00
YingboHAO 0055161273 Add test_api_auto_recover.py and test audio files 2026-02-02 13:49:01 +00:00
YingboHAO 1eb04f53a2 Replace install_deps.sh with start_server.py one-click deployment 2026-01-26 07:34:54 +00:00
YingboHAO 04f8bc40b0 Update test_api.py 2026-01-23 17:47:31 +00:00
YingboHAO 4df5b0582f Add vLLM plugin support for high-performance ASR serving 2026-01-23 17:32:24 +00:00