Machine Unlearning on Spoken Language Understanding

📖 Overview

This repository contains the detailed experimental results for the paper "Alexa, can you forget me?” Machine Unlearning Benchmark on Spoken Language Understanding, accepted at INTERSPEECH 2025.

🔗 Table of Contents

⚙️ Experimental Setup
📊 Detailed Results
- 1. Comparison of unlearning methods
- 2. Best Learning Rate (LR) for Unlearning Methods
📜 License
📧 Contact
📄 Citation

⚙️ Experimental Setup

Transformer models. For English datasets (FSC and SLURP), we use wav2vec 2.0 and HuBERT models, while for ITALIC and SpeechMASSIVE we employ the multilingual XLS-R-128 and XLS-R-53 models, with the latter ASR fine-tuned for the target language (i.e., Italian, German, French). On FSC, the models are trained for 2800 steps, with 8 batch size, 10% warmup steps, 4 gradient accumulation steps, 5e-4 initial learning rate, AdamW with weight decay, plateau scheduler, and early stopping criterion. SLURP follows the same configuration, but training lasts for 30000 steps. On ITALIC and SpeechMASSIVE, both XLS-R-128 and specialized XLS-R-53 models are trained with the same configuration, but for 15000 total steps, 1e-4 initial learning rate, and 1 gradient accumulation step.

Unlearning methods. All unlearning methods were tested for a single epoch and with the hyperparameters recommended in the original papers. For example, in cf-k, the entire network was frozen except the last layer (cf-1). With SCRUB, a temperature T = 4.0 was used, and with Bad Teaching, the KL temperature is 1. In UNSIR, the learning rate of the noise was set to 0.01. AdamW was used as the optimizer for all methods to remain compliant with the training.

Datasets. It is possible to obtain the datasets used in the experiments by running the relative notebooks present in the unlearning_datasets_creation folder. Once the datasets are created, they can be found in the data_name folder, where name is the name of the dataset. The datasets are FSC (data_fsc), SLURP* (data_slurp*), ITALIC (data_italic), SpeechMASSIVE (data_sm-de and data_sm-fr).

All experiments are conducted on a single NVIDIA A6000 48GB GPU.

📊 Detailed Results

1. Comparison of unlearning methods

Legend:

Bold = best result
Underlined = second best

A. FSC

Method	F1_Test	Acc_Test	F1_Forget	Acc_Forget	MIA	GUM	Speedup	F1_Test	Acc_Test	F1_Forget	Acc_Forget	MIA	GUM	Speedup
	wav2vec 2.0 →							HuBERT →
Orig.	0.994	0.994	1.000	1.000	0.508	0.000	1.00×	0.993	0.991	1.000	1.000	0.511	0.000	1.00×
Gold	0.993	0.992	0.997	0.998	0.503	0.000	1.00×	0.991	0.990	0.996	0.998	0.507	0.000	1.00×
FT	0.993	0.993	0.999	0.999	0.504	0.517	7.96×	0.979	0.979	0.993	0.992	0.508	0.514	7.69×
NG	0.987	0.988	0.976	0.984	0.501	0.816	206.9×	0.992	0.990	0.996	0.996	0.514	0.000	201.1×
NG+	0.994	0.994	0.994	0.996	0.493	0.000	4.03×	0.979	0.983	0.929	0.955	0.510	0.336	3.90×
CF-k	0.994	0.994	1.000	1.000	0.501	0.606	16.97×	0.993	0.991	1.000	1.000	0.505	0.642	26.70×
UNSIR	0.991	0.992	1.000	1.000	0.506	0.447	6.55×	0.994	0.992	0.998	0.998	0.508	0.484	6.38×
BT	0.993	0.994	1.000	1.000	0.508	0.000	4.78×	0.993	0.991	0.999	0.999	0.504	0.363	4.65×
BT-L	0.994	0.994	0.996	0.998	0.506	0.431	5.87×	0.993	0.991	0.997	0.998	0.506	0.464	5.69×
SCRUB	0.994	0.994	1.000	1.000	0.506	0.439	6.21×	0.993	0.992	0.998	0.999	0.508	0.479	6.22×

B. SLURP*

Method	F1_Test	Acc_Test	F1_Forget	Acc_Forget	MIA	GUM	Speedup	F1_Test	Acc_Test	F1_Forget	Acc_Forget	MIA	GUM	Speedup
	wav2vec 2.0 →							HuBERT →
Orig.	0.689	0.815	1.000	0.999	0.628	0.000	1.000×	0.712	0.830	1.000	1.000	0.613	0.000	1.000×
Gold	0.707	0.825	0.711	0.822	0.506	0.000	1.000×	0.704	0.826	0.715	0.821	0.492	0.000	1.000×
FT	0.638	0.750	0.970	0.989	0.648	0.000	83.78×	0.734	0.827	1.000	1.000	0.611	0.088	79.00×
NG	0.695	0.809	0.986	0.988	0.604	0.563	1748×	0.718	0.830	0.959	0.993	0.587	0.587	1654×
NG+	0.701	0.810	0.995	0.993	0.603	0.446	41.63×	0.630	0.777	0.852	0.913	0.453	0.578	39.30×
CF-k	0.709	0.818	1.000	1.000	0.626	0.089	291.9×	0.715	0.831	1.000	1.000	0.608	0.196	274.2×
UNSIR	0.673	0.801	1.000	1.000	0.637	0.000	64.07×	0.722	0.832	1.000	1.000	0.613	0.000	60.44×
BT	0.710	0.815	0.999	0.999	0.619	0.275	50.35×	0.711	0.830	1.000	1.000	0.613	0.000	47.42×
BT-L	0.680	0.811	0.995	0.998	0.637	0.000	61.74×	0.685	0.792	0.907	0.954	0.558	0.578	58.11×
SCRUB	0.697	0.824	0.999	0.999	0.608	0.429	64.82×	0.704	0.832	1.000	1.000	0.600	0.350	65.40×

C. ITALIC

Method	F1_Test	Acc_Test	F1_Forget	Acc_Forget	MIA	GUM	Speedup	F1_Test	Acc_Test	F1_Forget	Acc_Forget	MIA	GUM	Speedup
	XLS-R 128 →							XLS-R 53-IT →
Orig.	0.688	0.735	0.894	0.965	0.632	0.000	1.000×	0.778	0.837	1.000	1.000	0.615	0.000	1.000×
Gold	0.643	0.709	0.568	0.631	0.532	0.000	1.000×	0.784	0.835	0.736	0.824	0.478	0.000	1.000×
FT	0.638	0.686	0.671	0.739	0.555	0.590	30.80×	0.711	0.778	0.850	0.899	0.550	0.551	31.10×
NG	0.679	0.728	0.868	0.933	0.603	0.646	613.4×	0.590	0.688	0.621	0.767	0.525	0.766	623.0×
NG+	0.658	0.710	0.001	0.007	0.932	0.000	15.14×	0.743	0.800	0.936	0.950	0.582	0.418	15.37×
CF-k	0.677	0.737	0.871	0.963	0.626	0.253	98.59×	0.781	0.838	1.000	1.000	0.609	0.201	98.99×
UNSIR	0.636	0.701	0.830	0.925	0.621	0.328	22.01×	0.775	0.838	1.000	1.000	0.612	0.109	22.26×
BT	0.683	0.734	0.639	0.720	0.481	0.504	17.90×	0.731	0.803	0.848	0.903	0.557	0.491	17.94×
BT-L	0.686	0.734	0.651	0.749	0.518	0.558	22.02×	0.729	0.801	0.876	0.913	0.564	0.499	22.21×
SCRUB	0.442	0.464	0.357	0.377	0.533	0.536	23.25×	0.770	0.809	0.990	0.982	0.610	0.164	22.66×

D. SpeechMASSIVE (DE)

Method	F1_Test	Acc_Test	F1_Forget	Acc_Forget	MIA	GUM	Speedup	F1_Test	Acc_Test	F1_Forget	Acc_Forget	MIA	GUM	Speedup
	XLS-R 128 →							XLS-R 53-DE →
Orig.	0.584	0.681	0.841	0.938	0.621	0.000	1.000×	0.778	0.804	1.000	1.000	0.622	0.000	1.000×
Gold	0.566	0.672	0.529	0.674	0.513	0.000	1.000×	0.745	0.795	0.706	0.818	0.493	0.000	1.000×
FT	0.498	0.579	0.548	0.694	0.543	0.588	34.34×	0.661	0.729	0.905	0.938	0.585	0.464	17.79×
NG	0.550	0.651	0.726	0.818	0.562	0.797	1078×	0.764	0.796	0.957	0.985	0.587	0.643	558.7×
NG+	0.540	0.635	0.567	0.700	0.487	0.522	16.89×	0.759	0.789	0.878	0.944	0.568	0.431	8.770×
CF-k	0.587	0.682	0.865	0.941	0.622	0.000	109.9×	0.777	0.803	1.000	1.000	0.616	0.208	56.93×
UNSIR	0.565	0.649	0.788	0.924	0.616	0.197	27.46×	0.785	0.804	1.000	1.000	0.619	0.114	14.23×
BT	0.584	0.682	0.789	0.912	0.582	0.489	20.02×	0.726	0.783	0.945	0.976	0.585	0.418	10.41×
BT-L	0.584	0.681	0.786	0.912	0.576	0.523	24.87×	0.729	0.784	0.948	0.979	0.587	0.434	12.94×
SCRUB	0.584	0.682	0.780	0.918	0.600	0.429	26.86×	0.781	0.800	1.000	1.000	0.615	0.211	13.43×

E. SpeechMASSIVE (FR)

Method	F1_Test	Acc_Test	F1_Forget	Acc_Forget	MIA	GUM	Speedup	F1_Test	Acc_Test	F1_Forget	Acc_Forget	MIA	GUM	Speedup
	XLS-R 128 →							XLS-R 53-FR →
Orig.	0.410	0.543	0.572	0.733	0.629	0.000	1.000×	0.756	0.815	1.000	1.000	0.635	0.000	1.000×
Gold	0.469	0.618	0.460	0.580	0.509	0.000	1.000×	0.772	0.807	0.800	0.825	0.520	0.000	1.000×
FT	0.400	0.527	0.465	0.589	0.539	0.545	18.12×	0.759	0.816	0.974	0.997	0.627	0.255	18.42×
NG	0.317	0.438	0.349	0.491	0.564	0.749	597.3×	0.768	0.815	0.935	0.979	0.617	0.501	610.2×
NG+	0.382	0.501	0.008	0.028	0.882	0.000	8.900×	0.759	0.807	0.943	0.982	0.620	0.317	9.230×
CF-k	0.436	0.551	0.594	0.767	0.612	0.414	58.23×	0.770	0.815	1.000	1.000	0.624	0.338	58.86×
UNSIR	0.420	0.548	0.591	0.755	0.620	0.259	14.67×	0.768	0.815	1.000	1.000	0.633	0.089	14.94×
BT	0.411	0.544	0.583	0.742	0.597	0.409	10.60×	0.772	0.816	0.981	0.994	0.621	0.317	10.82×
BT-L	0.412	0.543	0.574	0.739	0.591	0.447	13.18×	0.727	0.789	0.981	0.985	0.623	0.306	13.42×
SCRUB	0.409	0.539	0.532	0.702	0.611	0.358	13.68×	0.769	0.814	1.000	1.000	0.633	0.089	13.94×

2. Best Learning Rate (LR) for Unlearning Methods

The following table shows the best LR value that maximizes the Global Unlearning Metric (GUM) proposed in the paper for each method, dataset, and model.

Model	fsc HuBERT	fsc wav2vec 2.0	slurp HuBERT	slurp wav2vec 2.0	italic XLS-R 128	italic XLS-R 53-IT	sm-de XLS-R 128-	sm-de XLS-R 53-DE	sm-fr XLS-R 128	sm-fr XLS-R 53-FR
`FT`	0.0001	1e-05	1e-05	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001	1e-05
`NG`	1e-06	5e-06	5e-06	5e-06	1e-06	5e-06	5e-06	5e-06	5e-06	5e-06
`NG+`	1e-06	1e-06	5e-06	1e-06	1e-06	5e-07	5e-07	1e-06	1e-06	5e-07
`CF-k`	5e-05	0.0001	0.0001	5e-05	0.0001	0.0001	0.0001	1e-05	0.0001	5e-05
`UNSIR`	1e-05	0.0001	1e-05	0.0001	0.0001	1e-05	5e-05	1e-05	1e-05	1e-05
`BT`	1e-06	1e-06	1e-06	5e-06	1e-06	1e-06	1e-06	1e-06	1e-06	5e-06
`BT-L`	1e-06	5e-07	1e-06	1e-06	1e-06	1e-06	1e-06	1e-06	1e-06	1e-06
`SCRUB`	5e-06	1e-06	5e-06	5e-06	5e-06	5e-06	5e-06	5e-06	5e-06	5e-07

For further details, please refer to the accompanying paper.

📜 License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

📧 Contact

For any inquiries or feedback, please contact Alkis Koudounas and Claudio Savelli.

📄 Citation

If you find this repository useful, please consider citing our paper:

@inproceedings{koudounas2025unlearning,
  title={"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding},
  author={Koudounas, Alkis and Savelli, Claudio and Giobergia, Flavio and Baralis, Elena},
  booktitle={Proc. Interspeech 2025}, 
  year={2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
unlearning_datasets_creation		unlearning_datasets_creation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
result_analysis.ipynb		result_analysis.ipynb
unlearners.py		unlearners.py
unlearning.py		unlearning.py
unlearning_evaluation.py		unlearning_evaluation.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Unlearning on Spoken Language Understanding

📖 Overview

🔗 Table of Contents

⚙️ Experimental Setup

📊 Detailed Results

1. Comparison of unlearning methods

A. FSC

B. SLURP*

C. ITALIC

D. SpeechMASSIVE (DE)

E. SpeechMASSIVE (FR)

2. Best Learning Rate (LR) for Unlearning Methods

📜 License

📧 Contact

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Machine Unlearning on Spoken Language Understanding

📖 Overview

🔗 Table of Contents

⚙️ Experimental Setup

📊 Detailed Results

1. Comparison of unlearning methods

A. FSC

B. SLURP*

C. ITALIC

D. SpeechMASSIVE (DE)

E. SpeechMASSIVE (FR)

2. Best Learning Rate (LR) for Unlearning Methods

📜 License

📧 Contact

📄 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages