3.8 KiB
3.8 KiB
VibeVoice vLLM ASR Deployment
Deploy VibeVoice ASR model as a high-performance API service using vLLM. This plugin provides OpenAI-compatible API endpoints for speech-to-text transcription with streaming support.
🔥 Key Features
- 🚀 High-Performance Serving: Optimized for high-throughput ASR inference with vLLM's continuous batching
- 📡 OpenAI-Compatible API: Standard
/v1/chat/completionsendpoint with streaming support - 🎵 Long Audio Support: Process up to 60+ minutes of audio in a single request
- 🔌 Plugin Architecture: No vLLM source code modification required - just install and run
🛠️ Installation
Using Official vLLM Docker Image (Recommended)
- Clone the repository
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
- Launch the server (background mode)
docker run -d --gpus all --name vibevoice-vllm \
--ipc=host \
-p 8000:8000 \
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-v $(pwd):/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:v0.14.1 \
-c "python3 /app/vllm_plugin/scripts/start_server.py"
- View logs
docker logs -f vibevoice-vllm
Note
:
- The
-dflag runs the container in background (detached mode)- Use
docker stop vibevoice-vllmto stop the service- The model will be downloaded to HuggingFace cache (
~/.cache/huggingface) inside the container
🚀 Usages
Test the API
Once the vLLM server is running, test it with the provided script:
# Basic transcription
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav
# With hotwords for better recognition of specific terms
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav --hotwords "Microsoft,VibeVoice"
# With auto-recovery from repetition loops (for long audio)
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav
# Auto-recover with hotwords
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav --hotwords "Microsoft,VibeVoice"
Note
:
- The audio/video file must be inside the mounted directory (
/appin the container). Copy your files to the VibeVoice folder before testing.- Hotwords help improve recognition of domain-specific terms like proper nouns, technical terms, and speaker names.
Environment Variables
| Variable | Description | Default |
|---|---|---|
VIBEVOICE_FFMPEG_MAX_CONCURRENCY |
Maximum FFmpeg processes for audio decoding | 64 |
PYTORCH_ALLOC_CONF |
PyTorch memory allocator config | expandable_segments:True |
📊 Performance Tips
- GPU Memory: Use
--gpu-memory-utilization 0.9for maximum throughput if you have dedicated GPU - Batch Size: Increase
--max-num-seqsfor higher concurrency (requires more GPU memory) - FFmpeg Concurrency: Tune
VIBEVOICE_FFMPEG_MAX_CONCURRENCYbased on CPU cores
🚨 Troubleshooting
Common Issues
-
"CUDA out of memory"
- Reduce
--gpu-memory-utilization - Reduce
--max-num-seqs - Use smaller
--max-model-len
- Reduce
-
"Audio decoding failed"
- Ensure FFmpeg is installed:
ffmpeg -version - Check audio file format is supported
- Ensure FFmpeg is installed:
-
"Model not found"
- Ensure model path contains
config.jsonand model weights - Generate tokenizer files if missing
- Ensure model path contains
-
"Plugin not loaded"
- Verify installation:
pip show vibevoice - Check entry point:
pip show -f vibevoice | grep entry
- Verify installation: