Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Evaluation

This directory contains the evaluation code to reproduce the results from the SAM-Audio paper. The evaluation framework supports multiple datasets, prompting modes (text-only, span, visual), and metrics.

Setup

Before running evaluation, ensure you have:

  1. Installed the SAM-Audio package and its dependencies
  2. Authenticated with Hugging Face to access the model checkpoints (see main README)

Quick Start

Run evaluation on the default setting (instr-pro):

python main.py

You can also use multiple GPUs to speed up evaluation:

torchrun --nproc_per_node=<ngpus> python main.py

Evaluate on a specific setting:

python main.py --setting sfx

Evaluate on multiple settings:

python main.py --setting sfx speech music

Available Evaluation Settings

Run python main.py --help to see all available settings

Command Line Options

python main.py [OPTIONS]

Options:

  • -s, --setting - Which setting(s) to evaluate (default: instr-pro)

    • Choices: See available settings above
    • Can specify multiple settings: --setting sfx speech music
  • --cache-path - Where to cache downloaded datasets (default: ~/.cache/sam_audio)

  • -p, --checkpoint-path - Model checkpoint to evaluate (default: facebook/sam-audio-1b)

    • Can use local path or Hugging Face model ID
  • -b, --batch-size - Batch size for evaluation (default: 1)

  • -w, --num-workers - Number of data loading workers (default: 4)

  • -c, --candidates - Number of reranking candidates (default: 8)

Evaluation Metrics

The evaluation framework computes the following metrics:

  • Judge - SAM Audio Judge quality assessment metric
  • Aesthetic - Aesthetic quality metric
  • CLAP - Audio-text alignment metric (CLAP similarity)
  • ImageBind - Audio-video alignment metric (for visual settings only)

Output

Results are saved to the results/ directory as JSON files, one per setting:

results/
├── sfx.json
├── speech.json
└── music.json

Each JSON file contains the averaged metric scores across all samples in that setting.

Example output:

{
    "JudgeOverall": "4.386",
    "JudgeFaithfulness": "4.708",
    "JudgeRecall": "4.934",
    "JudgePrecision": "4.451",
    "ContentEnjoyment": "5.296",
    "ContentUsefulness": "6.903",
    "ProductionComplexity": "4.301",
    "ProductionQuality": "7.100",
    "CLAPSimilarity": "0.271"
}