155 lines
4.1 KiB
Markdown
155 lines
4.1 KiB
Markdown
# VibeVoice ASR LoRA Fine-tuning
|
||
|
||
This directory contains scripts for LoRA (Low-Rank Adaptation) fine-tuning of the VibeVoice ASR model.
|
||
|
||
## Requirements
|
||
|
||
```bash
|
||
# you need to install vibevoice first
|
||
# pip install -e .[asr]
|
||
|
||
pip install peft
|
||
```
|
||
|
||
## Toy Dataset
|
||
|
||
> **Note**: The `toy_dataset/` included in this directory contains **synthetic audio generated by VibeVoice TTS** for demonstration purposes only. It is NOT a full finetuning dataset.
|
||
>
|
||
> When using your own data, you should:
|
||
> - Prepare real audio recordings with accurate transcriptions
|
||
> - Adjust hyperparameters (learning rate, epochs, LoRA rank) based on your dataset size and domain
|
||
> - Consider the audio quality and speaker diversity in your data
|
||
|
||
## Data Format
|
||
|
||
Training data should be organized as pairs of audio files and JSON labels in the same directory:
|
||
|
||
```
|
||
toy_dataset/
|
||
├── 0.mp3
|
||
├── 0.json
|
||
├── 1.mp3
|
||
├── 1.json
|
||
└── ...
|
||
```
|
||
|
||
### JSON Label Format
|
||
|
||
Each JSON file should have the following structure:
|
||
|
||
```json
|
||
{
|
||
"audio_duration": 351.73,
|
||
"audio_path": "0.mp3",
|
||
"segments": [
|
||
{
|
||
"speaker": 0,
|
||
"text": "Hey everyone, welcome back...",
|
||
"start": 0.0,
|
||
"end": 38.68
|
||
},
|
||
{
|
||
"speaker": 1,
|
||
"text": "Thanks for having me...",
|
||
"start": 38.75,
|
||
"end": 77.88
|
||
}
|
||
],
|
||
"customized_context": ["Tea Brew", "Aiden Host", "The property is near Meter Street."] // optional, domain-specific terms or context sentences
|
||
}
|
||
```
|
||
|
||
## Training
|
||
|
||
### Basic
|
||
|
||
```bash
|
||
# 1 GPU
|
||
torchrun --nproc_per_node=1 lora_finetune.py \
|
||
--model_path microsoft/VibeVoice-ASR \
|
||
--data_dir ./toy_dataset \
|
||
--output_dir ./output \
|
||
--num_train_epochs 3 \
|
||
--per_device_train_batch_size 1 \
|
||
--learning_rate 1e-4 \
|
||
--bf16 \
|
||
--report_to none
|
||
|
||
# Specific GPUs (e.g., GPU 0,1,2,3)
|
||
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 lora_finetune.py \
|
||
--model_path microsoft/VibeVoice-ASR \
|
||
--data_dir ./toy_dataset \
|
||
--output_dir ./output \
|
||
--num_train_epochs 3 \
|
||
--per_device_train_batch_size 1 \
|
||
--learning_rate 1e-4 \
|
||
--bf16 \
|
||
--report_to none
|
||
```
|
||
|
||
### Full Options
|
||
|
||
The script uses HuggingFace's `TrainingArguments`, so all standard options are available:
|
||
|
||
```bash
|
||
torchrun --nproc_per_node=4 lora_finetune.py \
|
||
--model_path microsoft/VibeVoice-ASR \
|
||
--data_dir ./toy_dataset \
|
||
--output_dir ./output \
|
||
--lora_r 16 \
|
||
--lora_alpha 32 \
|
||
--lora_dropout 0.05 \
|
||
--num_train_epochs 3 \
|
||
--per_device_train_batch_size 1 \
|
||
--gradient_accumulation_steps 4 \
|
||
--learning_rate 1e-4 \
|
||
--warmup_ratio 0.1 \
|
||
--weight_decay 0.01 \
|
||
--max_grad_norm 1.0 \
|
||
--logging_steps 10 \
|
||
--save_steps 100 \
|
||
--gradient_checkpointing \
|
||
--bf16 \
|
||
--report_to none
|
||
```
|
||
|
||
### Key Parameters
|
||
|
||
| Parameter | Default | Description |
|
||
|-----------|---------|-------------|
|
||
| `--lora_r` | 16 | LoRA rank (lower = fewer params, higher = more expressive) |
|
||
| `--lora_alpha` | 32 | LoRA scaling factor (typically 2x rank) |
|
||
| `--lora_dropout` | 0.05 | Dropout for LoRA layers |
|
||
| `--per_device_train_batch_size` | 8 | Batch size per device |
|
||
| `--gradient_accumulation_steps` | 1 | Effective batch size = batch_size × grad_accum |
|
||
| `--learning_rate` | 5e-5 | Learning rate (1e-4 to 2e-4 typical for LoRA) |
|
||
| `--gradient_checkpointing` | False | Enable to reduce memory usage |
|
||
| `--use_customized_context` | True | Include customized_context from JSON as additional context |
|
||
| `--max_audio_length` | None | Skip audio longer than this (seconds) |
|
||
|
||
## Inference with Fine-tuned Model
|
||
|
||
```bash
|
||
python inference_lora.py \
|
||
--base_model microsoft/VibeVoice-ASR \
|
||
--lora_path ./output \
|
||
--audio_file ./toy_dataset/0.mp3 \
|
||
--context_info "Tea Brew, Aiden Host"
|
||
```
|
||
|
||
## Merging LoRA Weights (Optional)
|
||
|
||
To merge LoRA weights into the base model for faster inference:
|
||
|
||
```python
|
||
from peft import PeftModel
|
||
|
||
# Load base model + LoRA
|
||
model = VibeVoiceASRForConditionalGeneration.from_pretrained("microsoft/VibeVoice-ASR", ...)
|
||
model = PeftModel.from_pretrained(model, "./output")
|
||
|
||
# Merge and save
|
||
model = model.merge_and_unload()
|
||
model.save_pretrained("./merged_model")
|
||
```
|