This directory contains the evaluation code to reproduce the results from the SAM-Audio paper. The evaluation framework supports multiple datasets, prompting modes (text-only, span, visual), and metrics.
Before running evaluation, ensure you have:
- Installed the SAM-Audio package and its dependencies
- Authenticated with Hugging Face to access the model checkpoints (see main README)
Run evaluation on the default setting (instr-pro):
python main.pyYou can also use multiple GPUs to speed up evaluation:
torchrun --nproc_per_node=<ngpus> python main.pyEvaluate on a specific setting:
python main.py --setting sfxEvaluate on multiple settings:
python main.py --setting sfx speech musicRun python main.py --help to see all available settings
python main.py [OPTIONS]-
-s, --setting- Which setting(s) to evaluate (default:instr-pro)- Choices: See available settings above
- Can specify multiple settings:
--setting sfx speech music
-
--cache-path- Where to cache downloaded datasets (default:~/.cache/sam_audio) -
-p, --checkpoint-path- Model checkpoint to evaluate (default:facebook/sam-audio-1b)- Can use local path or Hugging Face model ID
-
-b, --batch-size- Batch size for evaluation (default:1) -
-w, --num-workers- Number of data loading workers (default:4) -
-c, --candidates- Number of reranking candidates (default:8)
The evaluation framework computes the following metrics:
- Judge - SAM Audio Judge quality assessment metric
- Aesthetic - Aesthetic quality metric
- CLAP - Audio-text alignment metric (CLAP similarity)
- ImageBind - Audio-video alignment metric (for visual settings only)
Results are saved to the results/ directory as JSON files, one per setting:
results/
├── sfx.json
├── speech.json
└── music.json
Each JSON file contains the averaged metric scores across all samples in that setting.
Example output:
{
"JudgeOverall": "4.386",
"JudgeFaithfulness": "4.708",
"JudgeRecall": "4.934",
"JudgePrecision": "4.451",
"ContentEnjoyment": "5.296",
"ContentUsefulness": "6.903",
"ProductionComplexity": "4.301",
"ProductionQuality": "7.100",
"CLAPSimilarity": "0.271"
}