VibeVoice/finetuning-asr/README.md

# VibeVoice ASR LoRA Fine-tuning

This directory contains scripts for LoRA (Low-Rank Adaptation) fine-tuning of the VibeVoice ASR model.

## Requirements

```bash
# you need to install vibevoice first
# pip install -e .[asr]

pip install peft
```

## Toy Dataset

> **Note**: The `toy_dataset/` included in this directory contains **synthetic audio generated by VibeVoice TTS** for demonstration purposes only. It is NOT a full finetuning dataset.
>
> When using your own data, you should:
> - Prepare real audio recordings with accurate transcriptions
> - Adjust hyperparameters (learning rate, epochs, LoRA rank) based on your dataset size and domain
> - Consider the audio quality and speaker diversity in your data

## Data Format

Training data should be organized as pairs of audio files and JSON labels in the same directory:

```
toy_dataset/
├── 0.mp3
├── 0.json
├── 1.mp3
├── 1.json
└── ...
```

### JSON Label Format

Each JSON file should have the following structure:

```json
{
  "audio_duration": 351.73,
  "audio_path": "0.mp3",
  "segments": [
    {
      "speaker": 0,
      "text": "Hey everyone, welcome back...",
      "start": 0.0,
      "end": 38.68
    },
    {
      "speaker": 1,
      "text": "Thanks for having me...",
      "start": 38.75,
      "end": 77.88
    }
  ],
  "customized_context": ["Tea Brew", "Aiden Host", "The property is near Meter Street."]  // optional, domain-specific terms or context sentences
}
```

## Training

### Basic

```bash
# 1 GPU
torchrun --nproc_per_node=1 lora_finetune.py \
    --model_path microsoft/VibeVoice-ASR \
    --data_dir ./toy_dataset \
    --output_dir ./output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-4 \
    --bf16 \
    --report_to none

# Specific GPUs (e.g., GPU 0,1,2,3)
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 lora_finetune.py \
    --model_path microsoft/VibeVoice-ASR \
    --data_dir ./toy_dataset \
    --output_dir ./output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --learning_rate 1e-4 \
    --bf16 \
    --report_to none
```

### Full Options

The script uses HuggingFace's `TrainingArguments`, so all standard options are available:

```bash
torchrun --nproc_per_node=4 lora_finetune.py \
    --model_path microsoft/VibeVoice-ASR \
    --data_dir ./toy_dataset \
    --output_dir ./output \
    --lora_r 16 \
    --lora_alpha 32 \
    --lora_dropout 0.05 \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --learning_rate 1e-4 \
    --warmup_ratio 0.1 \
    --weight_decay 0.01 \
    --max_grad_norm 1.0 \
    --logging_steps 10 \
    --save_steps 100 \
    --gradient_checkpointing \
    --bf16 \
    --report_to none
```

### Key Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--lora_r` | 16 | LoRA rank (lower = fewer params, higher = more expressive) |
| `--lora_alpha` | 32 | LoRA scaling factor (typically 2x rank) |
| `--lora_dropout` | 0.05 | Dropout for LoRA layers |
| `--per_device_train_batch_size` | 8 | Batch size per device |
| `--gradient_accumulation_steps` | 1 | Effective batch size = batch_size × grad_accum |
| `--learning_rate` | 5e-5 | Learning rate (1e-4 to 2e-4 typical for LoRA) |
| `--gradient_checkpointing` | False | Enable to reduce memory usage |
| `--use_customized_context` | True | Include customized_context from JSON as additional context |
| `--max_audio_length` | None | Skip audio longer than this (seconds) |

## Inference with Fine-tuned Model

```bash
python inference_lora.py \
    --base_model microsoft/VibeVoice-ASR \
    --lora_path ./output \
    --audio_file ./toy_dataset/0.mp3 \
    --context_info "Tea Brew, Aiden Host"
```

## Merging LoRA Weights (Optional)

To merge LoRA weights into the base model for faster inference:

```python
from peft import PeftModel

# Load base model + LoRA
model = VibeVoiceASRForConditionalGeneration.from_pretrained("microsoft/VibeVoice-ASR", ...)
model = PeftModel.from_pretrained(model, "./output")

# Merge and save
model = model.merge_and_unload()
model.save_pretrained("./merged_model")
```