cd945395d4
Nginx worker_processes now defaults to 2×N (where N is the number of DP replicas) instead of 'auto'. This ensures enough HTTP handler processes to fully saturate all GPU backends under heavy concurrent load. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
190 lines
6.2 KiB
Markdown
190 lines
6.2 KiB
Markdown
# VibeVoice vLLM ASR Deployment
|
||
|
||
<a href="https://huggingface.co/microsoft/VibeVoice-ASR"><img alt="Huggingface" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-VibeVoice--ASR-blue"></a>
|
||
|
||
Deploy VibeVoice ASR model as a high-performance API service using [vLLM](https://github.com/vllm-project/vllm). This plugin provides OpenAI-compatible API endpoints for speech-to-text transcription with streaming support.
|
||
|
||
## 🔥 Key Features
|
||
|
||
- **🚀 High-Performance Serving**: Optimized for high-throughput ASR inference with vLLM's continuous batching
|
||
- **📡 OpenAI-Compatible API**: Standard `/v1/chat/completions` endpoint with streaming support
|
||
- **🎵 Long Audio Support**: Process up to 60+ minutes of audio in a single request
|
||
- **🔌 Plugin Architecture**: No vLLM source code modification required - just install and run
|
||
- **⚡ Data Parallel (DP)**: Run independent model replicas across multiple GPUs with automatic load balancing behind a single port
|
||
|
||
## 🛠️ Installation
|
||
|
||
Using Official vLLM Docker Image (Recommended)
|
||
|
||
1. Clone the repository
|
||
```bash
|
||
git clone https://github.com/microsoft/VibeVoice.git
|
||
cd VibeVoice
|
||
```
|
||
|
||
2. Launch the server (background mode)
|
||
```bash
|
||
docker run -d --gpus all --name vibevoice-vllm \
|
||
--ipc=host \
|
||
-p 8000:8000 \
|
||
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
|
||
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
|
||
-v $(pwd):/app \
|
||
-w /app \
|
||
--entrypoint bash \
|
||
vllm/vllm-openai:v0.14.1 \
|
||
-c "python3 /app/vllm_plugin/scripts/start_server.py"
|
||
```
|
||
|
||
## ⚡ Multi-GPU Deployment
|
||
|
||
The launcher supports two types of GPU parallelism via `--tp` and `--dp` flags:
|
||
|
||
| Flag | Name | What it does |
|
||
|------|------|-------------|
|
||
| `--tp N` | Tensor Parallel | Splits **one model** across N GPUs (for models too large for a single GPU) |
|
||
| `--dp N` | Data Parallel | Runs **N independent replicas**, one per GPU, with automatic load balancing behind a single port |
|
||
|
||
### Data Parallel (Recommended for scaling throughput)
|
||
|
||
Run N independent replicas on N GPUs with automatic load balancing behind a single port.
|
||
When `--dp N` is specified (N > 1), the launcher automatically starts N independent vLLM
|
||
processes behind an nginx reverse proxy (2×N workers) for optimal throughput:
|
||
|
||
```bash
|
||
docker run -d --gpus '"device=0,1,2,3"' --name vibevoice-vllm \
|
||
--ipc=host \
|
||
-p 8000:8000 \
|
||
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
|
||
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
|
||
-v $(pwd):/app \
|
||
-w /app \
|
||
--entrypoint bash \
|
||
vllm/vllm-openai:v0.14.1 \
|
||
-c "python3 /app/vllm_plugin/scripts/start_server.py --dp 4"
|
||
```
|
||
|
||
Run on all 8 GPUs:
|
||
|
||
```bash
|
||
docker run -d --gpus all --name vibevoice-vllm \
|
||
--ipc=host \
|
||
-p 8000:8000 \
|
||
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
|
||
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
|
||
-v $(pwd):/app \
|
||
-w /app \
|
||
--entrypoint bash \
|
||
vllm/vllm-openai:v0.14.1 \
|
||
-c "python3 /app/vllm_plugin/scripts/start_server.py --dp 8"
|
||
```
|
||
|
||
### Tensor Parallel
|
||
|
||
Split a single model across 2 GPUs (useful if GPU memory is limited):
|
||
|
||
```bash
|
||
docker run -d --gpus '"device=0,1"' --name vibevoice-vllm \
|
||
--ipc=host \
|
||
-p 8000:8000 \
|
||
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
|
||
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
|
||
-v $(pwd):/app \
|
||
-w /app \
|
||
--entrypoint bash \
|
||
vllm/vllm-openai:v0.14.1 \
|
||
-c "python3 /app/vllm_plugin/scripts/start_server.py --tp 2"
|
||
```
|
||
|
||
### Hybrid (DP × TP)
|
||
|
||
Combine both — e.g., 2 replicas, each split across 2 GPUs (4 GPUs total):
|
||
|
||
```bash
|
||
docker run -d --gpus '"device=0,1,2,3"' --name vibevoice-vllm \
|
||
--ipc=host \
|
||
-p 8000:8000 \
|
||
-v $(pwd):/app \
|
||
-w /app \
|
||
--entrypoint bash \
|
||
vllm/vllm-openai:v0.14.1 \
|
||
-c "python3 /app/vllm_plugin/scripts/start_server.py --dp 2 --tp 2"
|
||
```
|
||
|
||
> **Note**: Total GPUs required = `dp × tp`. Make sure to expose enough GPU devices in the Docker `--gpus` flag.
|
||
|
||
3. View logs
|
||
```bash
|
||
docker logs -f vibevoice-vllm
|
||
```
|
||
|
||
> **Note**:
|
||
> - The `-d` flag runs the container in background (detached mode)
|
||
> - Use `docker stop vibevoice-vllm` to stop the service
|
||
> - The model will be downloaded to HuggingFace cache (`~/.cache/huggingface`) inside the container
|
||
|
||
## 🚀 Usages
|
||
|
||
### Test the API
|
||
|
||
Once the vLLM server is running, test it with the provided script:
|
||
|
||
```bash
|
||
# Basic transcription
|
||
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav
|
||
|
||
# With hotwords for better recognition of specific terms
|
||
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav --hotwords "Microsoft,VibeVoice"
|
||
|
||
```
|
||
|
||
```bash
|
||
# With auto-recovery from repetition loops (for long audio)
|
||
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav
|
||
|
||
# Auto-recover with hotwords
|
||
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav --hotwords "Microsoft,VibeVoice"
|
||
```
|
||
|
||
> **Note**:
|
||
> - The audio/video file must be inside the mounted directory (`/app` in the container). Copy your files to the VibeVoice folder before testing.
|
||
> - Hotwords help improve recognition of domain-specific terms like proper nouns, technical terms, and speaker names.
|
||
|
||
### Environment Variables
|
||
|
||
| Variable | Description | Default |
|
||
|----------|-------------|---------|
|
||
| `VIBEVOICE_FFMPEG_MAX_CONCURRENCY` | Maximum FFmpeg processes for audio decoding | `64` |
|
||
| `PYTORCH_ALLOC_CONF` | PyTorch memory allocator config | `expandable_segments:True` |
|
||
|
||
|
||
|
||
## 📊 Performance Tips
|
||
|
||
1. **GPU Memory**: Use `--gpu-memory-utilization 0.9` for maximum throughput if you have dedicated GPU
|
||
2. **Batch Size**: Increase `--max-num-seqs` for higher concurrency (requires more GPU memory)
|
||
3. **FFmpeg Concurrency**: Tune `VIBEVOICE_FFMPEG_MAX_CONCURRENCY` based on CPU cores
|
||
|
||
## 🚨 Troubleshooting
|
||
|
||
### Common Issues
|
||
|
||
1. **"CUDA out of memory"**
|
||
- Reduce `--gpu-memory-utilization`
|
||
- Reduce `--max-num-seqs`
|
||
- Use smaller `--max-model-len`
|
||
|
||
2. **"Audio decoding failed"**
|
||
- Ensure FFmpeg is installed: `ffmpeg -version`
|
||
- Check audio file format is supported
|
||
|
||
3. **"Model not found"**
|
||
- Ensure model path contains `config.json` and model weights
|
||
- Generate tokenizer files if missing
|
||
|
||
4. **"Plugin not loaded"**
|
||
- Verify installation: `pip show vibevoice`
|
||
- Check entry point: `pip show -f vibevoice | grep entry`
|
||
|
||
|