Skip to content

ryos17/phone-audio-mark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Phone Quality Audio Watermarking

This repository contains code for audio watermarking experiments for CS224S final project. We will update the repository and README as needed so that our partner Sanas can easily use our scripts and see data, but all code for the class has been uploaded by the deadline.

Setup

  1. Install Conda if you haven't already:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh
  1. Create and activate the conda environment:
conda env create -f environment.yml
conda activate audio-marking

If you are training model...

  1. Install FFmpeg (required by AudioCraft):
conda install "ffmpeg<5" -c conda-forge
  1. Clone AudioCraft repository (for AudioSeal training):
git clone https://github.com/facebookresearch/audiocraft.git
cd audiocraft
pip install -e .
cd ..

Project Structure

  • environment.yml: Conda environment configuration
  • .gitignore: Git ignore rules
  • README.md: This file
  • utils/dataset.py: Script to download and validate gigaspeech dataset
  • utils/encode.py: Script to add watermark to audio files
  • utils/decode.py: Script to detect watermark in audio files
  • utils/spectogram.py: Script to generate and save spectrograms of audio files
  • utils/analyze.py: Script to check audio file metadata (sample rate, channels, duration, format)

Usage

Analyzing Audio Files

To check the metadata of an audio file (sample rate, channels, duration, and format):

python utils/analyze.py path/to/audio.wav

The script will display:

  • Sample rate in Hz
  • Number of channels
  • Duration in seconds
  • Audio file format

Adding Watermark to Audio

To add a watermark to an audio file:

python utils/encode.py --input_path path/to/audio.wav --sample_rate 16000

Optional: Add a 16-bit message to the watermark:

python utils/encode.py --input_path path/to/audio.wav --message "1010101010101010"

Optional: Specify custom output path:

python utils/encode.py --input_path path/to/audio.wav --output_path path/to/output.wav

Optional: Use a custom model (default: "audioseal_wm_16bits"):

python utils/encode.py --input_path path/to/audio.wav --model_path path/to/custom/generator_model.pth

Detecting Watermark

To detect watermark in an audio file:

python utils/decode.py path/to/audio.wav

Optional: Specify a custom model path (default: "audioseal_detector_16bits"):

python utils/decode.py path/to/audio.wav --model_path path/to/custom/detector_model.pth

The script will output:

  • Watermark probability (float number)
  • Message (16-bit binary vector if watermarked)

Generating Spectrograms

To generate spectrograms for one or more audio files:

python utils/spectogram.py path/to/audio1.wav path/to/audio2.wav

The script will:

  • Create a spectogram_files directory if it doesn't exist
  • Generate spectrograms for each input audio file
  • Save spectrograms as PNG files with the same name as the input files
  • Print information about each processed file

Working with Gigaspeech Dataset

The utils/dataset.py script provides utilities for working with the Gigaspeech dataset. It will download and validate the dataset. Also gives 10 sample wav audio saved in audio_files:

python utils/dataset.py

Training Visualization Tools

1. Single Run Analysis (analyze_train.py)

Visualize metrics from a single training run.

Basic Usage:

python utils/analyze_train.py <history_json> <metric_name> [options]

Example: Plotting Discriminator Loss

python utils/analyze_train.py dora/xps/a7a7d341/history.json d_loss \
    --output output \
    --name d_loss_a100 \
    --title "Loss for A₁₀₀" \
    --xlabel "Training Epochs" \
    --ylabel "Discriminator Loss" \
    --font-size 14

Available Options:

  • --output: Output directory (default: 'outputs')
  • --name: Output filename (without extension)
  • --title: Plot title (supports Unicode subscripts, e.g., A₁₀₀)
  • --xlabel: X-axis label (default: 'Epochs')
  • --ylabel: Y-axis label (defaults to metric name)
  • --legend: Legend labels (provide two space-separated values for train/val)
  • --font-size: Base font size (default: 12)

2. Multi-Run Comparison (analyze_batch_train.py)

Compare the same metric across multiple training runs in a single plot.

Basic Usage:

python utils/analyze_batch_train.py <history_json1> <history_json2> ... <metric_name> [options]

Example: Comparing Discriminator Loss

python utils/analyze_batch_train.py \
    dora/xps/6a28e352/history.json \
    dora/xps/a7a7d341/history.json \
    dora/xps/0427d672/history.json \
    d_loss \
    --output output \
    --name combined_d_loss \
    --title "Discriminator Loss" \
    --xlabel "Training Epochs" \
    --ylabel "Loss" \
    --legend "A₁₀" "A₁₀₀" "A₅₀₀₀" \
    --font-size 14 \
    --line-styles - - - \
    --epoch-limits 125 125 50

Additional Options:

  • --line-styles: Line styles for each run (e.g., - -- - for solid, dashed, dash-dot)
  • --colors: Custom colors for each run (hex codes)
  • --epoch-limits: Maximum epochs to plot for each run (e.g., 125 125 50 for 125 epochs for first two runs, 50 for third)
  • Other options same as analyze_train.py

Notes:

  • Use Unicode subscripts (e.g., A₁₀, A₁₀₀) for clean formatting in titles and legends
  • Default font is DeJavu Serif with Times New Roman fallback
  • Output is saved as high-resolution PNG (300 DPI)

Notes

  • The script automatically handles both training and validation metrics if available
  • For subscripts in titles, use Unicode characters (e.g., A₁₀₀)
  • Output is saved as a high-resolution PNG file (300 DPI)
  • Use --epoch-limits to compare models trained for different numbers of epochs
  • When using --epoch-limits, make sure the number of limits matches the number of input files

Visqol Score Analysis

The utils/visqol_stats.py script calculates statistics for Visqol scores from audio mark evaluation results.

Usage

python utils/visqol_stats.py <input_file>

Example

python utils/visqol_stats.py eval_results/8khz_10hrs_125epochs.txt

Output

The script will display the following statistics for the Visqol scores:

  • Number of samples
  • Maximum score
  • Minimum score
  • Mean score
  • Median score
  • Standard deviation

Notes

  • The script automatically detects the Visqol score column (case-insensitive)
  • Values are truncated to 3 decimal places
  • Handles various error cases (file not found, empty file, invalid format)

Training

1. Configure AudioCraft Training Parameters

Replace [audiocraft root]/configs with our config folder. This contains the hyperparameters necessary for training in 8 kHz sampling rate.

# cd to phone-audio-mark root
cd ..

# copy config to audiocraft root
cp -r config/* audiocraft/config/

2. Preparing GigaSpeech Dataset for Training

python prepare.py --size xs --output audiocraft/gigaspeech

Options:

  • --size: Size of the GigaSpeech dataset ('xs', 's', 'm', 'l', 'xl')
  • --output: Name of the output JSONL file (without extension)

The script will:

  • Load the GigaSpeech dataset from HuggingFace
  • Create a JSONL file in audiocraft/gigaspeech/ directory
  • Format each entry with required AudioCraft fields:
    • path: Path to audio file
    • duration: Audio duration in seconds
    • sample_rate: Sampling rate
    • amplitude: null
    • weight: null
    • info_path: null

3. Configure AudioCraft Dataset Format

Create the following datasource definition in [audiocraft root]/configs/dset/audio/gigaspeech.yaml:

# @package __global__

datasource:
  max_sample_rate: 16000
  max_channels: 1

  train: gigaspeech
  valid: gigaspeech
  evaluate: gigaspeech
  generate: gigaspeech

4. Configure Dora Path

By default, checkpoints and inference files are saved in /tmp/audiocraft_$USER/outputs. However, to make our checkpoints more accessible, it is better to set custom path.

Create the following config definition in [audiocraft root]/my_config.yaml:

# File name: my_config.yaml

default:
  dora_dir: /root/phone-audio-mark/dora
  partitions:
    global: your_slurm_partitions
    team: your_slurm_partitions
  reference_dir: /root/phone-audio-mark/dora/reference

5. Run Training

AUDIOCRAFT_CONFIG=my_config.yaml dora run solver=watermark/robustness dset=audio/gigaspeech

Multi-GPU Training

To train using multiple GPUs, use the following command:

    torchrun  --master-addr $(hostname -I | awk '{print $1}')     --master-port 29500   --node_rank 0  --nnodes 1     --nproc-per-node 8  -m dora run    solver=watermark/robustness    dset=audio/gigaspeech_8khz_xl_half

Adjust --nproc-per-node to match your number of available GPUs. If you run into "opening too many files" errors, it is most likely the wandb artifacts so I recommend uninstalling wandb via pip uninstall wandb.

Evaluation

Create virtual environment

conda create -n audiomarkbench python=3.10 -y
conda activate audiomarkbench

Clone AudioMarkBench

git clone https://github.com/moyangkuo/AudioMarkBench/

Install requirements (skipping over uninstallable packages)

while IFS= read -r pkg; do                                               
  echo "Installing $pkg"
  pip install "$pkg" || echo "  → Skipped $pkg"
done < requirements.txt

VisQOL Setup

Install Bazelisk

# macOS (Homebrew)
brew install bazelisk

# Verify that it picks the version in .bazelversion
bazel version

Clone the ViSQOL repository

git clone https://github.com/google/visqol.git
cd visqol

Point Bazel at your Python interpreter

export PYTHON_BIN_PATH="$(which python)"

Clean any previous outputs and build the required packages

bazel clean --expunge
bazel build --action_env=PYTHON_BIN_PATH -c opt \
    //python:visqol_lib_py.so \
    //:similarity_result_py_pb2 \
    //:visqol_config_py_pb2

Install using pip

pip install -e .

To create the correct folder structure that AudioMarkBench expects:

cd <location_of_audiomarkbench>/AudioMarkBench/no-box

# 1) Copy the built Python package
cp -R ../../visqol/bazel-bin/python/visqol ./visqol

# 2) Ensure init files exist
touch visqol/__init__.py
touch visqol/pb2/__init__.py

Finally, your folder structure should look like this:

.
├── AudioMarkBench/
└── └── no-box/
    ├── ├── nobox_audioseal_audiomarkdata.py
    └── └── visqol/                    ← copied package/
        ├── ├── __init__.py
        ├── ├── visqol_lib_py.so       ← native extension
        └── └── pb2/
            ├── ├── __init__.py
            ├── ├── similarity_result_pb2.py
            └── └── visqol_config_pb2.py

Usage

Run the nobox_audioseal_audiomarkdata.py script from your project’s root directory. Below is a complete example that:

  • Encodes and decodes 2,000 test samples
  • Uses batches of 50
  • Resamples everything to 8 kHz
  • Saves the perturbed outputs
  • Applies an MP3 perturbation at 16 kbps
  • Tags the model outputs with the prefix 8khz_100hrs_epoch125
python nobox_audioseal_audiomarkdata.py \
  --encode \
  --testset_size 2000 \
  --batch_size 50 \
  --save_pert \
  --resample_rate 8000 \
  --model_prefix 8khz_100hrs_epoch125 \
  --common_perturbation mp3 \
  --mp3_bitrates 16

Command-line Arguments

Flag Type Default Description
--encode store_true False Run the encoding step before decoding.
--testset_size <int> integer 100 Number of test samples to process.
--batch_size <int> integer 100 Number of samples to process in each batch.
--save_pert store_true False If set, save each perturbed audio file to disk.
--resample_rate, -sr <int> integer 16000 Target sample rate (Hz) for all audio I/O (e.g. use 8000 for phone-quality audio).
--model_prefix <str> string '' Prefix to tag model output files (e.g. experiment name or epoch identifier).
--common_perturbation <str> string '' Perturbation type to apply. Options: time_stretch, gaussian_noise, background_noise, quantization, soundstream, opus, encodec, lowpass, highpass, echo, mp3, smooth.
--mp3_bitrates <ints> list of integers [8, 16] One or more MP3 bitrates (kbps) to try when --common_perturbation mp3 is used.
--gpu <int> integer 0 CUDA GPU index to use (if you have a compatible GPU and CUDA installed).
--max_length <int> integer 5*16000 Maximum audio length to load (in samples). Defaults to 5 seconds at 16 kHz.

Running Inference with Custom Trained Model

1. Prepare the Model

First, clone the AudioSeal repository and install required dependencies:

git clone https://github.com/facebookresearch/audioseal.git
pip install fire  # Required for checkpoint conversion

2. Convert Checkpoint

Convert your trained model checkpoint to the inference format:

python audioseal/src/scripts/checkpoints.py \
    --checkpoint=/path/to/checkpoint_50.th \
    --outdir=model_outputs \
    --suffix=model_name

3. Run Watermarking

Use the converted model to add a watermark to an audio file:

python utils/encode.py \
    --input_path path/to/input.wav \
    --message "1010101010101010" \
    --sample_rate 8000 \
    --output_path output.wav \
    --model_path model_outputs/checkpoint_generator_model_name.pth

4. Verify Watermark

Check if the watermark was successfully embedded:

python utils/decode.py output.wav --model_path model_outputs/checkpoint_detector_model_name.pth

References

About

Neural Audio Watermarking for Phone Quality Audio

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors