Files
VibeVoice/docs/vibevoice-vllm-asr.md
T
Jianwei Yu cd945395d4 feat: set nginx workers to 2×dp for optimal HTTP throughput
Nginx worker_processes now defaults to 2×N (where N is the number of DP
replicas) instead of 'auto'. This ensures enough HTTP handler processes
to fully saturate all GPU backends under heavy concurrent load.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-27 09:16:05 +00:00

190 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# VibeVoice vLLM ASR Deployment
<a href="https://huggingface.co/microsoft/VibeVoice-ASR"><img alt="Huggingface" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-VibeVoice--ASR-blue"></a>
Deploy VibeVoice ASR model as a high-performance API service using [vLLM](https://github.com/vllm-project/vllm). This plugin provides OpenAI-compatible API endpoints for speech-to-text transcription with streaming support.
## 🔥 Key Features
- **🚀 High-Performance Serving**: Optimized for high-throughput ASR inference with vLLM's continuous batching
- **📡 OpenAI-Compatible API**: Standard `/v1/chat/completions` endpoint with streaming support
- **🎵 Long Audio Support**: Process up to 60+ minutes of audio in a single request
- **🔌 Plugin Architecture**: No vLLM source code modification required - just install and run
- **⚡ Data Parallel (DP)**: Run independent model replicas across multiple GPUs with automatic load balancing behind a single port
## 🛠️ Installation
Using Official vLLM Docker Image (Recommended)
1. Clone the repository
```bash
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
```
2. Launch the server (background mode)
```bash
docker run -d --gpus all --name vibevoice-vllm \
--ipc=host \
-p 8000:8000 \
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-v $(pwd):/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:v0.14.1 \
-c "python3 /app/vllm_plugin/scripts/start_server.py"
```
## ⚡ Multi-GPU Deployment
The launcher supports two types of GPU parallelism via `--tp` and `--dp` flags:
| Flag | Name | What it does |
|------|------|-------------|
| `--tp N` | Tensor Parallel | Splits **one model** across N GPUs (for models too large for a single GPU) |
| `--dp N` | Data Parallel | Runs **N independent replicas**, one per GPU, with automatic load balancing behind a single port |
### Data Parallel (Recommended for scaling throughput)
Run N independent replicas on N GPUs with automatic load balancing behind a single port.
When `--dp N` is specified (N > 1), the launcher automatically starts N independent vLLM
processes behind an nginx reverse proxy (2×N workers) for optimal throughput:
```bash
docker run -d --gpus '"device=0,1,2,3"' --name vibevoice-vllm \
--ipc=host \
-p 8000:8000 \
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-v $(pwd):/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:v0.14.1 \
-c "python3 /app/vllm_plugin/scripts/start_server.py --dp 4"
```
Run on all 8 GPUs:
```bash
docker run -d --gpus all --name vibevoice-vllm \
--ipc=host \
-p 8000:8000 \
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-v $(pwd):/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:v0.14.1 \
-c "python3 /app/vllm_plugin/scripts/start_server.py --dp 8"
```
### Tensor Parallel
Split a single model across 2 GPUs (useful if GPU memory is limited):
```bash
docker run -d --gpus '"device=0,1"' --name vibevoice-vllm \
--ipc=host \
-p 8000:8000 \
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-v $(pwd):/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:v0.14.1 \
-c "python3 /app/vllm_plugin/scripts/start_server.py --tp 2"
```
### Hybrid (DP × TP)
Combine both — e.g., 2 replicas, each split across 2 GPUs (4 GPUs total):
```bash
docker run -d --gpus '"device=0,1,2,3"' --name vibevoice-vllm \
--ipc=host \
-p 8000:8000 \
-v $(pwd):/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:v0.14.1 \
-c "python3 /app/vllm_plugin/scripts/start_server.py --dp 2 --tp 2"
```
> **Note**: Total GPUs required = `dp × tp`. Make sure to expose enough GPU devices in the Docker `--gpus` flag.
3. View logs
```bash
docker logs -f vibevoice-vllm
```
> **Note**:
> - The `-d` flag runs the container in background (detached mode)
> - Use `docker stop vibevoice-vllm` to stop the service
> - The model will be downloaded to HuggingFace cache (`~/.cache/huggingface`) inside the container
## 🚀 Usages
### Test the API
Once the vLLM server is running, test it with the provided script:
```bash
# Basic transcription
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav
# With hotwords for better recognition of specific terms
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav --hotwords "Microsoft,VibeVoice"
```
```bash
# With auto-recovery from repetition loops (for long audio)
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav
# Auto-recover with hotwords
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav --hotwords "Microsoft,VibeVoice"
```
> **Note**:
> - The audio/video file must be inside the mounted directory (`/app` in the container). Copy your files to the VibeVoice folder before testing.
> - Hotwords help improve recognition of domain-specific terms like proper nouns, technical terms, and speaker names.
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `VIBEVOICE_FFMPEG_MAX_CONCURRENCY` | Maximum FFmpeg processes for audio decoding | `64` |
| `PYTORCH_ALLOC_CONF` | PyTorch memory allocator config | `expandable_segments:True` |
## 📊 Performance Tips
1. **GPU Memory**: Use `--gpu-memory-utilization 0.9` for maximum throughput if you have dedicated GPU
2. **Batch Size**: Increase `--max-num-seqs` for higher concurrency (requires more GPU memory)
3. **FFmpeg Concurrency**: Tune `VIBEVOICE_FFMPEG_MAX_CONCURRENCY` based on CPU cores
## 🚨 Troubleshooting
### Common Issues
1. **"CUDA out of memory"**
- Reduce `--gpu-memory-utilization`
- Reduce `--max-num-seqs`
- Use smaller `--max-model-len`
2. **"Audio decoding failed"**
- Ensure FFmpeg is installed: `ffmpeg -version`
- Check audio file format is supported
3. **"Model not found"**
- Ensure model path contains `config.json` and model weights
- Generate tokenizer files if missing
4. **"Plugin not loaded"**
- Verify installation: `pip show vibevoice`
- Check entry point: `pip show -f vibevoice | grep entry`