Nginx worker_processes now defaults to 2×N (where N is the number of DP
replicas) instead of 'auto'. This ensures enough HTTP handler processes
to fully saturate all GPU backends under heavy concurrent load.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pass VIBEVOICE_FFMPEG_MAX_CONCURRENCY and VLLM_MEDIA_LOADING_THREAD_COUNT
to each worker subprocess so they inherit the correct settings regardless
of how the container is launched (--skip-deps or not).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When --dp N is specified (N > 1), the launcher now starts N independent
vLLM processes behind an nginx reverse proxy instead of using vLLM's
built-in DP coordinator. This avoids the single-process HTTP bottleneck
when handling large base64 audio payloads, achieving near-linear scaling
(7.2x with 8 GPUs at 4096 concurrent requests).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add --dp/--data-parallel-size flag for running independent model replicas
across multiple GPUs with automatic load balancing behind a single port
- Add --tp/--tensor-parallel-size flag (previously hardcoded to 1)
- Update docs/vibevoice-vllm-asr.md with multi-GPU deployment guide
covering DP, TP, and hybrid (DP × TP) configurations
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add gradio_asr_demo_api_video.py: Gradio web UI supporting audio/video upload,
streaming output, hotwords, and Cloudflare tunnel
- Add demo/asr_demo/: demo audio and video files for the Gradio interface
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>