This repository contains code for audio watermarking experiments for CS224S final project. We will update the repository and README as needed so that our partner Sanas can easily use our scripts and see data, but all code for the class has been uploaded by the deadline.
- Install Conda if you haven't already:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh- Create and activate the conda environment:
conda env create -f environment.yml
conda activate audio-marking- Install FFmpeg (required by AudioCraft):
conda install "ffmpeg<5" -c conda-forge- Clone AudioCraft repository (for AudioSeal training):
git clone https://github.com/facebookresearch/audiocraft.git
cd audiocraft
pip install -e .
cd ..environment.yml: Conda environment configuration.gitignore: Git ignore rulesREADME.md: This fileutils/dataset.py: Script to download and validate gigaspeech datasetutils/encode.py: Script to add watermark to audio filesutils/decode.py: Script to detect watermark in audio filesutils/spectogram.py: Script to generate and save spectrograms of audio filesutils/analyze.py: Script to check audio file metadata (sample rate, channels, duration, format)
To check the metadata of an audio file (sample rate, channels, duration, and format):
python utils/analyze.py path/to/audio.wavThe script will display:
- Sample rate in Hz
- Number of channels
- Duration in seconds
- Audio file format
To add a watermark to an audio file:
python utils/encode.py --input_path path/to/audio.wav --sample_rate 16000Optional: Add a 16-bit message to the watermark:
python utils/encode.py --input_path path/to/audio.wav --message "1010101010101010"Optional: Specify custom output path:
python utils/encode.py --input_path path/to/audio.wav --output_path path/to/output.wavOptional: Use a custom model (default: "audioseal_wm_16bits"):
python utils/encode.py --input_path path/to/audio.wav --model_path path/to/custom/generator_model.pthTo detect watermark in an audio file:
python utils/decode.py path/to/audio.wavOptional: Specify a custom model path (default: "audioseal_detector_16bits"):
python utils/decode.py path/to/audio.wav --model_path path/to/custom/detector_model.pthThe script will output:
- Watermark probability (float number)
- Message (16-bit binary vector if watermarked)
To generate spectrograms for one or more audio files:
python utils/spectogram.py path/to/audio1.wav path/to/audio2.wavThe script will:
- Create a
spectogram_filesdirectory if it doesn't exist - Generate spectrograms for each input audio file
- Save spectrograms as PNG files with the same name as the input files
- Print information about each processed file
The utils/dataset.py script provides utilities for working with the Gigaspeech dataset. It will download and validate the dataset. Also gives 10 sample wav audio saved in audio_files:
python utils/dataset.pyVisualize metrics from a single training run.
Basic Usage:
python utils/analyze_train.py <history_json> <metric_name> [options]Example: Plotting Discriminator Loss
python utils/analyze_train.py dora/xps/a7a7d341/history.json d_loss \
--output output \
--name d_loss_a100 \
--title "Loss for A₁₀₀" \
--xlabel "Training Epochs" \
--ylabel "Discriminator Loss" \
--font-size 14Available Options:
--output: Output directory (default: 'outputs')--name: Output filename (without extension)--title: Plot title (supports Unicode subscripts, e.g., A₁₀₀)--xlabel: X-axis label (default: 'Epochs')--ylabel: Y-axis label (defaults to metric name)--legend: Legend labels (provide two space-separated values for train/val)--font-size: Base font size (default: 12)
Compare the same metric across multiple training runs in a single plot.
Basic Usage:
python utils/analyze_batch_train.py <history_json1> <history_json2> ... <metric_name> [options]Example: Comparing Discriminator Loss
python utils/analyze_batch_train.py \
dora/xps/6a28e352/history.json \
dora/xps/a7a7d341/history.json \
dora/xps/0427d672/history.json \
d_loss \
--output output \
--name combined_d_loss \
--title "Discriminator Loss" \
--xlabel "Training Epochs" \
--ylabel "Loss" \
--legend "A₁₀" "A₁₀₀" "A₅₀₀₀" \
--font-size 14 \
--line-styles - - - \
--epoch-limits 125 125 50Additional Options:
--line-styles: Line styles for each run (e.g.,- -- -for solid, dashed, dash-dot)--colors: Custom colors for each run (hex codes)--epoch-limits: Maximum epochs to plot for each run (e.g.,125 125 50for 125 epochs for first two runs, 50 for third)- Other options same as
analyze_train.py
Notes:
- Use Unicode subscripts (e.g., A₁₀, A₁₀₀) for clean formatting in titles and legends
- Default font is DeJavu Serif with Times New Roman fallback
- Output is saved as high-resolution PNG (300 DPI)
- The script automatically handles both training and validation metrics if available
- For subscripts in titles, use Unicode characters (e.g., A₁₀₀)
- Output is saved as a high-resolution PNG file (300 DPI)
- Use
--epoch-limitsto compare models trained for different numbers of epochs - When using
--epoch-limits, make sure the number of limits matches the number of input files
The utils/visqol_stats.py script calculates statistics for Visqol scores from audio mark evaluation results.
python utils/visqol_stats.py <input_file>python utils/visqol_stats.py eval_results/8khz_10hrs_125epochs.txtThe script will display the following statistics for the Visqol scores:
- Number of samples
- Maximum score
- Minimum score
- Mean score
- Median score
- Standard deviation
- The script automatically detects the Visqol score column (case-insensitive)
- Values are truncated to 3 decimal places
- Handles various error cases (file not found, empty file, invalid format)
Replace [audiocraft root]/configs with our config folder. This contains the hyperparameters necessary for training in 8 kHz sampling rate.
# cd to phone-audio-mark root
cd ..
# copy config to audiocraft root
cp -r config/* audiocraft/config/python prepare.py --size xs --output audiocraft/gigaspeechOptions:
--size: Size of the GigaSpeech dataset ('xs', 's', 'm', 'l', 'xl')--output: Name of the output JSONL file (without extension)
The script will:
- Load the GigaSpeech dataset from HuggingFace
- Create a JSONL file in
audiocraft/gigaspeech/directory - Format each entry with required AudioCraft fields:
- path: Path to audio file
- duration: Audio duration in seconds
- sample_rate: Sampling rate
- amplitude: null
- weight: null
- info_path: null
Create the following datasource definition in [audiocraft root]/configs/dset/audio/gigaspeech.yaml:
# @package __global__
datasource:
max_sample_rate: 16000
max_channels: 1
train: gigaspeech
valid: gigaspeech
evaluate: gigaspeech
generate: gigaspeechBy default, checkpoints and inference files are saved in /tmp/audiocraft_$USER/outputs. However, to make our checkpoints more accessible, it is better to set custom path.
Create the following config definition in [audiocraft root]/my_config.yaml:
# File name: my_config.yaml
default:
dora_dir: /root/phone-audio-mark/dora
partitions:
global: your_slurm_partitions
team: your_slurm_partitions
reference_dir: /root/phone-audio-mark/dora/referenceAUDIOCRAFT_CONFIG=my_config.yaml dora run solver=watermark/robustness dset=audio/gigaspeechTo train using multiple GPUs, use the following command:
torchrun --master-addr $(hostname -I | awk '{print $1}') --master-port 29500 --node_rank 0 --nnodes 1 --nproc-per-node 8 -m dora run solver=watermark/robustness dset=audio/gigaspeech_8khz_xl_halfAdjust --nproc-per-node to match your number of available GPUs. If you run into "opening too many files" errors, it is most likely the wandb artifacts so I recommend uninstalling wandb via pip uninstall wandb.
Create virtual environment
conda create -n audiomarkbench python=3.10 -y
conda activate audiomarkbenchClone AudioMarkBench
git clone https://github.com/moyangkuo/AudioMarkBench/Install requirements (skipping over uninstallable packages)
while IFS= read -r pkg; do
echo "Installing $pkg…"
pip install "$pkg" || echo " → Skipped $pkg"
done < requirements.txt# macOS (Homebrew)
brew install bazelisk
# Verify that it picks the version in .bazelversion
bazel versiongit clone https://github.com/google/visqol.git
cd visqolexport PYTHON_BIN_PATH="$(which python)"bazel clean --expunge
bazel build --action_env=PYTHON_BIN_PATH -c opt \
//python:visqol_lib_py.so \
//:similarity_result_py_pb2 \
//:visqol_config_py_pb2pip install -e .cd <location_of_audiomarkbench>/AudioMarkBench/no-box
# 1) Copy the built Python package
cp -R ../../visqol/bazel-bin/python/visqol ./visqol
# 2) Ensure init files exist
touch visqol/__init__.py
touch visqol/pb2/__init__.py.
├── AudioMarkBench/
└── └── no-box/
├── ├── nobox_audioseal_audiomarkdata.py
└── └── visqol/ ← copied package/
├── ├── __init__.py
├── ├── visqol_lib_py.so ← native extension
└── └── pb2/
├── ├── __init__.py
├── ├── similarity_result_pb2.py
└── └── visqol_config_pb2.pyRun the nobox_audioseal_audiomarkdata.py script from your project’s root directory. Below is a complete example that:
- Encodes and decodes 2,000 test samples
- Uses batches of 50
- Resamples everything to 8 kHz
- Saves the perturbed outputs
- Applies an MP3 perturbation at 16 kbps
- Tags the model outputs with the prefix
8khz_100hrs_epoch125
python nobox_audioseal_audiomarkdata.py \
--encode \
--testset_size 2000 \
--batch_size 50 \
--save_pert \
--resample_rate 8000 \
--model_prefix 8khz_100hrs_epoch125 \
--common_perturbation mp3 \
--mp3_bitrates 16| Flag | Type | Default | Description |
|---|---|---|---|
--encode |
store_true | False |
Run the encoding step before decoding. |
--testset_size <int> |
integer | 100 |
Number of test samples to process. |
--batch_size <int> |
integer | 100 |
Number of samples to process in each batch. |
--save_pert |
store_true | False |
If set, save each perturbed audio file to disk. |
--resample_rate, -sr <int> |
integer | 16000 |
Target sample rate (Hz) for all audio I/O (e.g. use 8000 for phone-quality audio). |
--model_prefix <str> |
string | '' |
Prefix to tag model output files (e.g. experiment name or epoch identifier). |
--common_perturbation <str> |
string | '' |
Perturbation type to apply. Options: time_stretch, gaussian_noise, background_noise, quantization, soundstream, opus, encodec, lowpass, highpass, echo, mp3, smooth. |
--mp3_bitrates <ints> |
list of integers | [8, 16] |
One or more MP3 bitrates (kbps) to try when --common_perturbation mp3 is used. |
--gpu <int> |
integer | 0 |
CUDA GPU index to use (if you have a compatible GPU and CUDA installed). |
--max_length <int> |
integer | 5*16000 |
Maximum audio length to load (in samples). Defaults to 5 seconds at 16 kHz. |
First, clone the AudioSeal repository and install required dependencies:
git clone https://github.com/facebookresearch/audioseal.git
pip install fire # Required for checkpoint conversionConvert your trained model checkpoint to the inference format:
python audioseal/src/scripts/checkpoints.py \
--checkpoint=/path/to/checkpoint_50.th \
--outdir=model_outputs \
--suffix=model_nameUse the converted model to add a watermark to an audio file:
python utils/encode.py \
--input_path path/to/input.wav \
--message "1010101010101010" \
--sample_rate 8000 \
--output_path output.wav \
--model_path model_outputs/checkpoint_generator_model_name.pthCheck if the watermark was successfully embedded:
python utils/decode.py output.wav --model_path model_outputs/checkpoint_detector_model_name.pth