Batch encoder across multiple requests caused GPU OOM when vLLM
scheduler sends many audio items at once. The encoder intermediates
(~700MB per 69s audio) compete with KV cache for GPU memory.
Sequential encoding is stable and proven correct. The encoder
(267ms per request) is not the primary throughput bottleneck when
encoder cache is enabled (default).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Nginx worker_processes now defaults to 2×N (where N is the number of DP
replicas) instead of 'auto'. This ensures enough HTTP handler processes
to fully saturate all GPU backends under heavy concurrent load.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pass VIBEVOICE_FFMPEG_MAX_CONCURRENCY and VLLM_MEDIA_LOADING_THREAD_COUNT
to each worker subprocess so they inherit the correct settings regardless
of how the container is launched (--skip-deps or not).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When --dp N is specified (N > 1), the launcher now starts N independent
vLLM processes behind an nginx reverse proxy instead of using vLLM's
built-in DP coordinator. This avoids the single-process HTTP bottleneck
when handling large base64 audio payloads, achieving near-linear scaling
(7.2x with 8 GPUs at 4096 concurrent requests).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add --dp/--data-parallel-size flag for running independent model replicas
across multiple GPUs with automatic load balancing behind a single port
- Add --tp/--tensor-parallel-size flag (previously hardcoded to 1)
- Update docs/vibevoice-vllm-asr.md with multi-GPU deployment guide
covering DP, TP, and hybrid (DP × TP) configurations
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add gradio_asr_demo_api_video.py: Gradio web UI supporting audio/video upload,
streaming output, hotwords, and Cloudflare tunnel
- Add demo/asr_demo/: demo audio and video files for the Gradio interface
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Added announcement that VibeVoice ASR is now part of Transformers v5.3.0 release
- Linked to the official Hugging Face Transformers release page
- Positioned as the latest news item with today's date
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add MPS device choice and auto-detect MPS availability
- Change default attention implementation to 'auto' with smart fallback
- Auto-detect flash_attention_2 availability on CUDA, fallback to sdpa
- Use sdpa for MPS and CPU devices (flash_attention_2 not supported)
- Use float32 dtype for MPS/CPU devices for better compatibility
Fixes#206
Fixes#199 - Object of type dtype is not JSON serializable
When loading models with torch_dtype as a torch.dtype object (e.g.,
torch.bfloat16), transformers would fail to serialize the config to
JSON for logging purposes, raising TypeError.
This fix:
- Adds _convert_dtype_to_string() helper function to convert torch.dtype
objects to their string representation (e.g., 'bfloat16')
- Overrides to_dict() method in VibeVoiceConfig, VibeVoiceASRConfig,
and VibeVoiceStreamingConfig to apply this conversion
The fix is backward compatible - string dtype values and None values
continue to work as expected.