3.6 KiB
3.6 KiB
VibeVoice vLLM ASR Deployment
Deploy VibeVoice ASR model as a high-performance API service using vLLM. This plugin provides OpenAI-compatible API endpoints for speech-to-text transcription with streaming support.
🔥 Key Features
- 🚀 High-Performance Serving: Optimized for high-throughput ASR inference with vLLM's continuous batching
- 📡 OpenAI-Compatible API: Standard
/v1/chat/completionsendpoint with streaming support - 🎵 Long Audio Support: Process up to 60+ minutes of audio in a single request
- 🔌 Plugin Architecture: No vLLM source code modification required - just install and run
🛠️ Installation
Using Official vLLM Docker Image (Recommended)
# 1. Pull the official vLLM image
docker pull vllm/vllm-openai:latest
# 2. Start an interactive container
docker run -it --gpus all --name vibevoice-vllm \
--ipc=host \
-p 8000:8000 \
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-v /path/to/models:/models \
-v /path/to/VibeVoice:/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:latest
# 3. Inside container: Install system dependencies
bash vllm_plugin/scripts/install_deps.sh
# 4. Inside container: Install VibeVoice with vLLM support
pip install -e .[vllm]
# 5. Inside container: (Optional) Generate tokenizer files if needed
python3 -m vllm_plugin.tools.generate_tokenizer_files --output /models/your_model
# 6. Inside container: Start vLLM server
vllm serve /models/your_model \
--served-model-name vibevoice \
--trust-remote-code \
--dtype bfloat16 \
--max-num-seqs 64 \
--max-model-len 65536 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.8 \
--enforce-eager \
--no-enable-prefix-caching \
--enable-chunked-prefill \
--chat-template-content-format openai \
--tensor-parallel-size 1 \
--allowed-local-media-path /app \
--port 8000
Note
: This approach allows you to switch models, adjust parameters, and debug issues without rebuilding the container.
🚀 Quick Start
Test the API
Once the vLLM server is running, test it with the provided script:
# Run the test script (inside container)
python3 vllm_plugin/tests/test_api.py /path/to/audio.wav
Environment Variables
| Variable | Description | Default |
|---|---|---|
VIBEVOICE_FFMPEG_MAX_CONCURRENCY |
Maximum FFmpeg processes for audio decoding | 64 |
PYTORCH_CUDA_ALLOC_CONF |
CUDA memory allocator config | expandable_segments:True |
📊 Performance Tips
- GPU Memory: Use
--gpu-memory-utilization 0.9for maximum throughput if you have dedicated GPU - Batch Size: Increase
--max-num-seqsfor higher concurrency (requires more GPU memory) - FFmpeg Concurrency: Tune
VIBEVOICE_FFMPEG_MAX_CONCURRENCYbased on CPU cores
🚨 Troubleshooting
Common Issues
-
"CUDA out of memory"
- Reduce
--gpu-memory-utilization - Reduce
--max-num-seqs - Use smaller
--max-model-len
- Reduce
-
"Audio decoding failed"
- Ensure FFmpeg is installed:
ffmpeg -version - Check audio file format is supported
- Ensure FFmpeg is installed:
-
"Model not found"
- Ensure model path contains
config.jsonand model weights - Generate tokenizer files if missing
- Ensure model path contains
-
"Plugin not loaded"
- Verify installation:
pip show vibevoice - Check entry point:
pip show -f vibevoice | grep entry
- Verify installation: