Files
VibeVoice/finetuning-asr/README.md
T

155 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# VibeVoice ASR LoRA Fine-tuning
This directory contains scripts for LoRA (Low-Rank Adaptation) fine-tuning of the VibeVoice ASR model.
## Requirements
```bash
# Install vibevoice first
pip install -e .
pip install peft
```
## Toy Dataset
> **Note**: The `toy_dataset/` included in this directory contains **synthetic audio generated by VibeVoice TTS** for demonstration purposes only. It is NOT a full finetuning dataset.
>
> When using your own data, you should:
> - Prepare real audio recordings with accurate transcriptions
> - Adjust hyperparameters (learning rate, epochs, LoRA rank) based on your dataset size and domain
> - Consider the audio quality and speaker diversity in your data
## Data Format
Training data should be organized as pairs of audio files and JSON labels in the same directory:
```
toy_dataset/
├── 0.mp3
├── 0.json
├── 1.mp3
├── 1.json
└── ...
```
### JSON Label Format
Each JSON file should have the following structure:
```json
{
"audio_duration": 351.73,
"audio_path": "0.mp3",
"segments": [
{
"speaker": 0,
"text": "Hey everyone, welcome back...",
"start": 0.0,
"end": 38.68
},
{
"speaker": 1,
"text": "Thanks for having me...",
"start": 38.75,
"end": 77.88
}
],
"customized_context": ["Tea Brew", "Aiden Host", "The property is near Meter Street."] // optional, domain-specific terms or context sentences
}
```
## Training
### Basic
```bash
# 1 GPU
torchrun --nproc_per_node=1 lora_finetune.py \
--model_path microsoft/VibeVoice-ASR \
--data_dir ./toy_dataset \
--output_dir ./output \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--learning_rate 1e-4 \
--bf16 \
--report_to none
# Specific GPUs (e.g., GPU 0,1,2,3)
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 lora_finetune.py \
--model_path microsoft/VibeVoice-ASR \
--data_dir ./toy_dataset \
--output_dir ./output \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--learning_rate 1e-4 \
--bf16 \
--report_to none
```
### Full Options
The script uses HuggingFace's `TrainingArguments`, so all standard options are available:
```bash
torchrun --nproc_per_node=4 lora_finetune.py \
--model_path microsoft/VibeVoice-ASR \
--data_dir ./toy_dataset \
--output_dir ./output \
--lora_r 16 \
--lora_alpha 32 \
--lora_dropout 0.05 \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 1e-4 \
--warmup_ratio 0.1 \
--weight_decay 0.01 \
--max_grad_norm 1.0 \
--logging_steps 10 \
--save_steps 100 \
--gradient_checkpointing \
--bf16 \
--report_to none
```
### Key Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--lora_r` | 16 | LoRA rank (lower = fewer params, higher = more expressive) |
| `--lora_alpha` | 32 | LoRA scaling factor (typically 2x rank) |
| `--lora_dropout` | 0.05 | Dropout for LoRA layers |
| `--per_device_train_batch_size` | 8 | Batch size per device |
| `--gradient_accumulation_steps` | 1 | Effective batch size = batch_size × grad_accum |
| `--learning_rate` | 5e-5 | Learning rate (1e-4 to 2e-4 typical for LoRA) |
| `--gradient_checkpointing` | False | Enable to reduce memory usage |
| `--use_customized_context` | True | Include customized_context from JSON as additional context |
| `--max_audio_length` | None | Skip audio longer than this (seconds) |
## Inference with Fine-tuned Model
```bash
python inference_lora.py \
--base_model microsoft/VibeVoice-ASR \
--lora_path ./output \
--audio_file ./toy_dataset/0.mp3 \
--context_info "Tea Brew, Aiden Host"
```
## Merging LoRA Weights (Optional)
To merge LoRA weights into the base model for faster inference:
```python
from peft import PeftModel
# Load base model + LoRA
model = VibeVoiceASRForConditionalGeneration.from_pretrained("microsoft/VibeVoice-ASR", ...)
model = PeftModel.from_pretrained(model, "./output")
# Merge and save
model = model.merge_and_unload()
model.save_pretrained("./merged_model")
```